data science with spark

In Apache Spark

Foundations of Data Science with Spark

Foundations of Data Science with Spark

July 16, 2015

@ksankar // doubleclix.wordpress.com

www.globalbigdataconference.com

Twitter : @bigdataconf

o Intro & Setup [8:00-8:20)• Goals/non-goals

o Spark & Data Science DevOps [8:20-8:40)• Spark in the context of Data

ScienceoWhere Exactly is Apache Spark headed ?

[8:40-9:30)• Spark Yesterday, Today &

Tomorrow• Spark Stack

o Break [9:30-10:00)

oDataFrames for the Data Scientist [10:00-11:30)• pySpark Classes• Walkthru DataFrames• Hands-on Notebooks

o [15] Discussions/Slack (11:30 - 11:45)

Agenda : Introduction To SparkAgenda : Introduction To Sparkhttp://globalbigdataconference.com/52/santa-clara/big-data-developer-conference/schedule.html

o Review (2:00-2:30)• 004-Orders-Homework-

Solution• MLlib Statistical Toolbox• Summary, Correlations

o [20] Linear Regression (2:30-2:45)o [20] “Mood Of the Union” (2:45-3:15)• State of the Union w/

Washington, Lincoln, FDR, JFK, Clinton, Bush & Obama • Map reduce, parse text

o Break (3:15-3:30)

o [60] Predicting Survivors with Classification (3:30-4:30)• Decision Trees• NaiveBayes (Titanic data set)

o Break (4:30-4:45) o [20] Clustering(4:45-5:05)• K-means for Gallactic Hoppers!

o [20]Recommendation Engine (5:05-5:25)• Collab Filtering w/movie lens

o [15] Discussions/Slack (5:45-6:00)

Agenda : Data Wrangling w/ DataFrames & MLlibAgenda : Data Wrangling w/ DataFrames & MLlib

http://globalbigdataconference.com/52/santa-clara/big-data-developer-conference/schedule.html

Goals & non-goalsGoals & non-goalsGoals

¤Understand how to program Machine Learning with Spark & Python

¤Focus on programming & ML application

¤Give you a focused time to work thru examples§ Work with me. I will wait

if you want to catch-up¤Less theory, more usage - let us

see if this works¤As straightforward as possible§ The programs can be

optimized

Non-goals¡Go deep into the algorithms• We don’t have sufficient

time. The topic can be easily a 5 day tutorial !

¡Dive into spark internals• That is for another day

¡The underlying computation, communication, constraints & distribution is a fascinating subject• Paco does a good job

explaining them¡A passive talk• Nope. Interactive &

hands-on

About MeAbout Meo Data Scientist • Decision Data Science & Product Data Science

• Insights = Intelligence + Inference + Interface [https://goo.gl/s2KB6L]• Predicting NFL with Elo like Nate Silver & 538 [NFL : http://goo.gl/Q2OgeJ, NBA’15 : https://goo.gl/aUhdo3]

o Have been speaking at OSCON [http://goo.gl/1MJLu], PyCon, Pydata [http://vimeo.com/63270513, http://www.slideshare.net/ksankar/pydata-19] …

o Full-day Spark workshop “Advanced Data Science w/ Spark” / Spark Summit-E’15[https://goo.gl/7SBKTC]o Co-author : “Fast Data Processing with Spark”, Packt Publishing [http://goo.gl/eNtXpT]o Reviewer : “Machine Learning with Spark” Packt Publishingo Have done lots of things:• Big Data (Retail, Bioinformatics, Financial, AdTech), Starting MS-CFRM, University of WA• Written Books (Web 2.0, Wireless, Java,…))Standards, some work in AI,

• Guest Lecturer at Naval PG School,…

o Volunteer as Robotics Judge at First Lego league World Competitionso @ksankar, doubleclix.wordpress.com

Close EncountersClose Encounters� 1st◦ This Tutorial

� 2nd◦ Do More Hands-on Walkthrough

� 3nd◦ Listen To Lectures◦ More competitions …

Spark InstallationSpark Installationo Install Spark 1.4.1 in local Machine• https://spark.apache.org/downloads.html• Pre-built For Hadoop 2.6 is fine

• Download & uncompress• Remember the path & use it wherever you see /usr/local/spark/• I have downloaded in /usr/local & have a softlink spark to the latest version

o Install iPython

Tutorial MaterialsTutorial MaterialsoGithub : https://github.com/xsankar/global-bd-conf• Clone or download zip

oOpen terminalo cd ~/global-bd-confo IPYTHON=1 IPYTHON_OPTS="notebook” /usr/local/spark/bin/pyspark --packages com.databricks:spark-csv_2.11:1.0.3

oNotes : • I have a soft link “spark” in my /usr/local that points to the spark version

that I use. For example ln -s spark-1.4.1/ spark

o Click on ipython dashboardo Run 000-PreFlightCheck.ipynbo Run 001-TestSparkCSV.ipynboNow you are ready for the workshop !

Spark & Data Science DevOpsSpark & Data Science DevOps

8:208:20

Spark in the context of data scienceSpark in the context of data science

Data Science :

The art of building a model with known knowns, which when let loose, works with unknown unknowns!

Data Science :

The art of building a model with known knowns, which when let loose, works with unknown unknowns!

Donald Rumsfeld is an armchair Data Scientist !

http://smartorg.com/2013/07/valuepoint19/

The World

Knowns

Unknowns

YouUnKnown Known

o Others know, you don’t o What we do

o Facts, outcomes or scenarios we have not encountered, nor considered

o “Black swans”, outliers, long tails of probability distributions

o Lack of experience, imagination

o Potential facts, outcomes we are aware, but not with certainty

o Stochastic processes, Probabilities

o Known Knownso There are things we know that we know

o Known Unknownso That is to say, there are things that we

now know we don't knowo But there are also Unknown Unknowns

o There are things we do not know we don't know

The curious case of the Data ScientistThe curious case of the Data ScientistoData Scientist is multi-faceted & ContextualoData Scientist should be building Data ProductsoData Scientist should tell a story

http://doubleclix.wordpress.com/2014/01/25/the-‐curious-‐case-‐of-‐the-‐data-‐scientist-‐profession/

Large is hard; Infinite is much easier !– Titus Brown

Data Science - ContextData Science - Context

o Scalable Model Deployment

o Big Data automation & purpose built appliances (soft/hard)

o Manage SLAs & response times

o Scalable Model Deployment

o Big Data automation & purpose built appliances (soft/hard)

o Manage SLAs & response times

o Volumeo Velocityo Streaming Data

o Volumeo Velocityo Streaming Data

o Canonical formo Data catalogo Data Fabric across the

organizationo Access to multiple

sources of data o Think Hybrid – Big Data

Apps, Appliances & Infrastructure

o Canonical formo Data catalogo Data Fabric across the

organizationo Access to multiple

sources of data o Think Hybrid – Big Data

Apps, Appliances & Infrastructure

CollectCollect StoreStore TransformTransform

o Metadatao Monitor counters &

Metricso Structured vs. Multi-‐

structured

o Metadatao Monitor counters &

Metricso Structured vs. Multi-‐

structured

o Flexible & Selectable§ Data Subsets § Attribute sets

o Flexible & Selectable§ Data Subsets § Attribute sets

o Refine model with§ Extended Data

subsets§ Engineered

Attribute setso Validation run across a

larger data set

o Refine model with§ Extended Data

subsets§ Engineered

Attribute setso Validation run across a

larger data set

ReasonReason ModelModel DeployDeploy

Data ManagementData Management

Data ScienceData Science

o Dynamic Data Setso 2 way key-‐value tagging of

datasetso Extended attribute setso Advanced Analytics

o Dynamic Data Setso 2 way key-‐value tagging of

datasetso Extended attribute setso Advanced Analytics

ExploreExploreVisualizeVisualize RecommendRecommend PredictPredict

o Performanceo Scalabilityo Refresh Latencyo In-‐memory Analytics

o Performanceo Scalabilityo Refresh Latencyo In-‐memory Analytics

o Advanced Visualizationo Interactive Dashboardso Map Overlayo Infographics

o Advanced Visualizationo Interactive Dashboardso Map Overlayo Infographics

¤ Bytes to Business a.k.a. Build the full stack

¤ Find Relevant Data For Business

¤ Connect the Dots

VolumeVolume

VelocityVelocity

VarietyVariety

Data Science - ContextData Science - Context

ContextContext

ConnectednessConnectedness

IntelligenceIntelligence

InterfaceInterface

InferenceInference

“Data of unusual size” that can't be brute forced

o Three Amigoso Interface = Cognitiono Intelligence = Compute(CPU) & Computational(GPU)o Infer Significance & Causality

Day in the life of a (super) ModelDay in the life of a (super) Model

IntelligenceIntelligence

InferenceInference

Data RepresentationData Representation

InterfaceInterface

AlgorithmsAlgorithms

ParametersParametersAttributesAttributes

Data (Scoring)Data (Scoring)

Model SelectionModel Selection

Reason & LearnReason & Learn

ModelsModels

Visualize, Recommend, Explore

Visualize, Recommend, Explore

Model AssessmentModel Assessment

Feature SelectionFeature SelectionDimensionality ReductionDimensionality Reduction

Data Science Maturity Model & SparkData Science Maturity Model & SparkIsolated Analytics Integrated Analytics Aggregated Analytics Automated Analytics

Data Small Data Larger Data set Big Data Big Data Factory Model

Context Local Domain Cross-‐domain + External Cross domain + External

Model, Reason & Deploy

• Single set of boxes, usually owned by the Model Builders

• Departmental

• Deploy -‐ Central Analytics Infrastructure

• Models still owned & operated by Modelers

• Partly Enterprise-‐wide

• Central Analytics Infrastructure• Model & Reason – by Model Builders• Deploy, Operate – by ops• Residuals and other metrics monitored

by modelers• Enterprise-‐wide

• Distributed Analytics Infrastructure• AI Augmented models• Model & Reason – by Model

Builders• Deploy, Operate – by ops• Data as a monetized service,

extending to eco system partners

• Reports • Dashboards • Dashboards + some APIs • Dashboards + Well defined APIs + programming models

Type • Descriptive & Reactive • + Predictive • + Adaptive • Adaptive

Datasets • All in the same box • Fixed data sets, usually in temp data spaces

• Flexible Data & Attribute sets • Dynamic datasets with well-‐defined refresh policies

Workload • Skunk works • Business relevant apps with approx SLAs

• High performance appliance clusters • Appliances and clusters for multiple workloads including real time apps

• Infrastructure for emerging technologies

Strategy • Informal definitions • Data definitions buried in the analytics models

• Some data definitions • Data catalogue, metadata & Annotations

• Big Data MDM Strategy

The Sense & Sensibility of a DataScientist DevOpsThe Sense & Sensibility of a DataScientist DevOps

Factory = Operational

Lab = Investigative

http://doubleclix.wordpress.com/2014/05/11/the-‐sense-‐sensibility-‐of-‐a-‐data-‐scientist-‐devops/

Where exactly is Apache Spark headed ? Where exactly is Apache Spark headed ?

Spark Yesterday, Today & Tomorrow …

“Unified engine across diverse data sources, workloads & environments”

8:408:40

http://free-‐stock-‐illustration.com/winding+road+clip+art

Spark 1.x• Fast engine for big data processing• Fast to run code & fast to write code• In-memory computation graphs with

compatibility with the Hadoop eco system and an interesting very usable APIs

• Iterative & interactive apps that operated on data multiple times, which are not a good use case for Hadoop.

Spark 1.3 & beyond has been the catalyst for a renaissance in Data Science !

Spark 1.4+• Multi-pass analytics - ML pipelines, GraphX• Ad-hoc queries - DataFrames• Real-time stream processing – Spark Streaming• Parallel Machine Learning Algorithms beyond the

basic RDDs• More types of data sources as input & output• More integration with R to span statistical computing

beyond “single-node tools”• More integration with apps like visualization

dashboards• More performance with even larger datasets &

complex applications – Project Tungsten

Spark Yesterday, Today & Tomorrow …

Spark DirectionsSpark Directions

Data ScienceData Science Platform APIsPlatform APIs Streaming, DAG Visualization & Debugging

Streaming, DAG Visualization & Debugging

Execution Optimization(Project Tungsten)Execution Optimization(Project Tungsten)o DataFrames

o ML Pipelineso SparkR

o Growing the eco system§ Data Sources -‐Uniform access to

diverse data sources§ Pluggable “smart” DataSource

API for reading/writing DataFrame while minimizing I/O

§ Spark Packages§ Deployment utilities for Google

Compute, Azure & Job Server

o Focus on CPU Efficiencyo Run-‐Time Code Generationo Cache Locality & cache aware

data structureso Binary Format for aggregationso Spark managed Memory

o Off-‐heap memory management

o Spark Streaming flow control & optimized state management

Spark-The (simple) StackSpark-The (simple) Stack

RDD – The workhorse of Core SparkRDD – The workhorse of Core Spark

o Resilient Distributed Datasets• Collection that can be operated in parallel

o Transformations – create RDDs• Map, Filter,…

o Actions – Get values• Collect, Take,…

oWe will apply these operations during this tutorial

DataFrame API

Spark Core

Spark SQL Spark Streaming

Spark R MLlib GraphX Packages

ML Pipelines Advanced Analytics

Neural Networks

Deep Learning

Parameter Server

R Scala JavaPython

Catalyst Optimizer – optimize execution plan

Data Sources - Parquet, Hadoop, Cassandra, JSON, CSV, JDBC,…

Tungsten Execution

RDD

SQL Query

DataFrame

Unresolved Logical Plan

Logical Plan

Optimized Logical Plan

Physical PlansPhysical PlansPhysical Plans

Cost

Mod

el

Selected Physical Plan RDDs

Catalog

AnalysisLogical

Optimization Physical Planning Code Generation

Query Optimization-Execution pipeline

Ref: Spark SQL paper

Spark DataFrames for the Data ScientistSpark DataFrames for the Data Scientist

“A towel is about the most massively useful thing an interstellar hitchhiker can have … any man who can hitch the length and breadth of the Galaxy, rough it … win through, and still know where his towel is, is clearly a man to be reckoned with.”

- From The Hitchhiker's Guide to the Galaxy, by Douglas Adams.

DataFrames ! The Most Massively useful thing a Data Scientist can have …

10:0010:00

Data Science “folk knowledge” (1 of A)Data Science “folk knowledge” (1 of A)o "If you torture the data long enough, it will confess to anything." – Hal Varian,

Computer Mediated Transactionso Learning = Representation + Evaluation + Optimizationo It’s Generalization that counts• The fundamental goal of machine learning is to generalize beyond the

examples in the training set

oData alone is not enough• Induction not deduction - Every learner should embody some knowledge

or assumptions beyond the data it is given in order to generalize beyond it

oMachine Learning is not magic – one cannot get something from nothing• In order to infer, one needs the knobs & the dials• One also needs a rich expressive dataset

A few useful things to know about machine learning - by Pedro Domingoshttp://dl.acm.org/citation.cfm?id=2347755

pyspark

pyspark.SparkContext()pyspark.SparkConf()pyspark.RDD()pyspark.Broadcast()pyspark.Accululator()pyspark.SparkFiles()pyspark.StorageLevel()

pyspark.sqlpyspark.streaming pyspark.mllib pyspark.ml

pyspark.streaming.StreamingContext()pyspark.streaming.Dstream()pyspark.streaming.kafkapyspark.streaming.kafka.Broker()…

pyspark.sql.SQLContext()pyspark.sql.DataFrame()pyspark.sql.DataFrameNaFunctions()pyspark.sql.DataFrameStatFunctions()pyspark.sql.DataFrameReader()pyspark.sql.DataFrameWriter()pyspark.sql.Column()pyspark.sql.Row()pyspark.sql.functions()pyspark.sql.types()pyspark.sql.Window()pyspark.sql.WindowSpec()pyspark.sql.GroupedData()pyspark.sql.HiveContext()

pyspark.mllib.classificationpyspark.mllib.clusteringpyspark.mllib.evaluationpyspark.mllib.featurepyspark.mllib.fpmpyspark.mllib.linalgpyspark.mllib.randompyspark.mllib.recommendationpyspark.mllib.regressionpyspark.mllib.statpyspark.mllib.treepyspark.mllib.util

ML Pipeline APIspyspark.ml.Transformerpyspark.ml.Estimatorpyspark.ml.Modelpyspark.ml.Pipelinepyspark.ml.PipelineModel

pyspark.ml.parampyspark.ml.featurepyspark.ml.classificationpyspark.ml.recommendationpyspark.ml.regressionpyspark.ml.tuningpyspark.ml.evaluation

1. SparkContext()1. SparkContext()

2. Read/Write2. Read/Write

3. Convert3. Convert

pyspark.sql.DataFrame

table

pandas.DataFrame

sqlContext.registerDataFrameAsTable(df, "aTable")

df2 = sqlContext.table("aTable")

df = createDataFrame(pandas.DataFrame)

p_df = df.toPandas()

4. Columns & Rows (1 of 3)4. Columns & Rows (1 of 3)o Select a column• by the df(“<columnName>”) notation or the df.<columnName> notation. • The recommended way is the df(“<columnName>”), reason being a column

name can collide with a dataframe method if we use the df.<columnName>

o Column-wise operations line +,-, *,/,% (modulo),&&,||, <,<=,> and >=• df(“total”) = df(“price”) * df(“qty”) • inequality operator is !==, the usual equalTo operator is === and an

equality test that is safe for null values <=>

oMeta operations – type conversion (cast), alias, not null, …• df_cars.mpg.cast("double").alias('mpg')

o Run arbitrary udfs on a column (see next page)

4. Columns & Rows (2 of 3)4. Columns & Rows (2 of 3)Run arbitrary udfs on a column

4. Columns & Rows (3 of 3)4. Columns & Rows (3 of 3)Interesting Operations …

Adding a column…

5. DataFrame : RDD-like Operations5. DataFrame : RDD-like OperationsFunction Description df.sort(<sort expression>) or df.orderBy(<sort expression>) Returns a sorted DataFrame. There are multiple ways of specifying the sort expression. Use of the orderBy is

recommended (as the syntax is closer to SQL) for example:df_orders_1.groupBy("CustomerID","Year").count() .orderBy(‘count’,ascending=False).show()

df.filter(<condition>) or df.where(<condition>) Returns a new DataFrame after applying the <condition>. The condition is usually based on a column.Use of the where form is recommended (as the syntax is closer to the SQL world), for example df_orders.where(df_orders.[‘shipCountry’] == ’France’)

df.colasce(n) Returns a DataFrame with n partitions, same as colasce(n) method of RDD

df.foreach(<function>) Applies a function on all the rows of a DataFramedf.map(lambda r:..) Applies the function on all the rows and returns the resulting object

df.flatMap(lambda r: …) Returns an RDD, flattened, after applying the function on all the rows of the DataFrame

df.rdd() Returns the DataFrame as an rdd of Row objectsdf.na.replace([<list of values to be replaced],[list of replacing values],subset=[list of columns]) or DataFrame.replace() or DataFrameNaFunctions.replace()

An interesting function, very useful and a little strange syntax-‐‑wise. The recommended form is the df.na.replace() even though the .na namespace throws it a little bit. Use the subset= for column names. The syntax is different from the Scala syntax.

6. DataFrame : Action6. DataFrame : Actiono cache()o collect(), collectAsList()o count()o describe(), o first(), head(), show(), take()o…

7. DataFrame : Scientific Functions7. DataFrame : Scientific Functions

8. DataFrame : Statistical Functions8. DataFrame : Statistical Functions

The pair-wise frequency (contingency table) of transmission type and no of speeds show interesting observation. • All automatic cars in the dataset are 3 speed while most of the manual transmission cars have 4 or 5 speeds• Almost all the manual cars have 2 barrels while the automatic cars have 2 and 4 barrels

9. DataFrame : Aggregate Functions9. DataFrame : Aggregate Functionso The pyspark.sql.functions class (and the org.apache.spark.sql.functions for Scala)

contains the aggregation functionso There are two types of aggregations, one on column values and the other on

subsets of column values i.e. grouped values of some other columns• pyspark.sql.functions.avg(“sales”)• pyspark.sql.functions.groupby(“year”).agg(“sales”:”avg”)

o Count(), countDistinct()o First(),last()

10. DataFrame : na10. DataFrame : nao One of the tenets of big data and data science is that data is never fully clean-while we

can handle types, formats et al, missing values is always challengingo One easy solution is to drop the rows that have missing values, but then we would lose

valuable data in the columns that do have values. o A better solution is to impute data based on some criteria. It is true that data cannot be

created out of thin air, but data can be inferred with some success – it is better than dropping the rows. • We can replace null with 0• A better solution is to replace numerical values with the average of the rest of the valid

values; for categorical replacing with the most common value is a good strategy• We could use mode or median instead of mean

• Another good strategy is to infer the missing value from other attributes ie “Evidence from multiple fields”.• For example the Titanic data has name and for imputing the missing age field, we could use the

Mr., Master, Mrs. Miss designation from the name and then fill-in the average of the age field from the corresponding designation. So a row with missing age with Master. In name would get the average age of all records with “Master.”

• There is also the filed for number of siblings and number of spouse. We could average the age based on the value of that field.

• We could even average the ages from different strategies.

10. DataFrame : na10. DataFrame : na

11. Joins/Set Operations a.k.a. 11. Joins/Set Operations a.k.a. Language Integrated Queries

12. SQL on Tables12. SQL on Tables

Hands-OnHands-Ono 003-DataFrame-For-DS• Understand and run the iPython Notebook

o 004-Orders• Homework – we will go thru the solution when we meet in the afternoon

Data Wrangling with SparkData Wrangling with Spark

2:002:00

Algorithm spectrumAlgorithm spectrum

o Regressiono Logito CARTo Ensemble

:Random Forest

o Clusteringo KNNo Genetic Algo Simulated

Annealing

o CollabFiltering

o SVMo Kernels

o SVD

o NNeto Boltzman

Machineo Feature

Learning

Machine Learning Cute Math Artificial Intelligence

Statistical ToolboxStatistical Toolboxo Sample data : Car mileage data

Linear Regression Linear Regression

2:302:30

Linear Regression - APILinear Regression - API

LabeledPoint The features and labels of a data pointLinearModel weights, interceptLinearRegressionModelBase predict()LinearRegressionModelLinearRegressionWithSGD

train(cls, data, iterations=100, step=1.0, miniBatchFraction=1.0, initialWeights=None, regParam=1.0, regType=None, intercept=False)

LassoModel Least-squares fit with an l_1 penalty term.

LassoWithSGDtrain(cls, data, iterations=100, step=1.0, regParam=1.0, miniBatchFraction=1.0,initialWeights=None)

RidgeRegressionModel Least-squares fit with an l_2 penalty term.

RidgeRegressionWithSGD

train(cls, data, iterations=100, step=1.0, regParam=1.0, miniBatchFraction=1.0, initialWeights=None)

Basic Linear RegressionBasic Linear Regression

Use LR model for prediction & calculate MSEUse LR model for prediction & calculate MSE

Step size is important, the model can diverge !Step size is important, the model can diverge !

Interesting step sizeInteresting step size

“Mood Of the Union” with TF-IDF “Mood Of the Union” with TF-IDF

2:452:45

Scenario – Mood Of the UnionScenario – Mood Of the Union

o It has been said that the State of the Union speech by the President of USA reflects the social challenge faced by the country ?

o If so, can we infer the mood of the country by analyzing SOTU ?o If we embark on this line of thought, how would we do it with Spark & python ?o Is it different from Hadoop-MapReduce ? o Is it better ?

POA (Plan Of Action)POA (Plan Of Action)o Collect State of the Union speech by George Washington, Abe Lincoln, FDR,

JFK, Bill Clinton, GW Bush & Barack Obama o Read the 7 SOTU from the 7 presidents into 7 RDDso Create word vectorso Transform into word frequency vectorso Remove stock common wordso Inspect to n words to see if they reflect the sentiment of the timeo Compute set difference and see how new words have cropped upo Compute TF-IDF (homework!)

Lookout for these interesting Spark featuresLookout for these interesting Spark featureso RDD Map-ReduceoHow to parse inputo Removing common wordso Sort rdd by value

Read & Create word vectorRead & Create word vector

iPython notebook at https://github.com/xsankar/cloaked-ironman

Remove Common Words – 1 of 3Remove Common Words – 1 of 3


FDR vs. Barack Obama as reflected by SOTUFDR vs. Barack Obama as reflected by SOTU

Barack Obama vs. Bill ClintonBarack Obama vs. Bill Clinton

GWB vs Abe Lincoln as reflected by SOTUGWB vs Abe Lincoln as reflected by SOTU

EpilogueEpilogueo Interesting ExerciseoHighlights• Map-reduce in a couple of lines !• But it is not exactly the same as Hadoop Mapreduce (see the excellent blog by Sean Owen1)

• Set differences using substractByKey• Ability to sort a map by values (or any arbitrary function, for that matter)

o To Explore as homework:• TF-IDF in http://spark.apache.org/docs/latest/mllib-feature-extraction.html#tf-idf

http://blog.cloudera.com/blog/2014/09/how-‐to-‐translate-‐from-‐mapreduce-‐to-‐apache-‐spark/

BreakBreak

3:153:15

Predicting Survivors with Classification Predicting Survivors with Classification

3:303:30

Data Science “folk knowledge” (Wisdom of Kaggle)Jeremy’s AxiomsData Science “folk knowledge” (Wisdom of Kaggle)Jeremy’s Axiomso Iteratively explore datao Tools• Excel Format, Perl, Perl Book, Spark !

o Get your head around data• Pivot Table

o Don’t over-complicateo If people give you data, don’t assume that you

need to use all of ito Look at pictures !o History of your submissions – keep a tabo Don’t be afraid to submit simple solutions• We will do this during this workshop

Ref: http://blog.kaggle.com/2011/03/23/getting-in-shape-for-the-sport-of-data-sciencetalk-by-jeremy-howard/

Titanic Passenger Metadata• Small• 3 Predictors

• Class• Sex• Age• Survived?

Classification - ScenarioClassification - Scenarioo This is a knowledge exerciseo Classify survival from the titanic dataoGives us a quick dataset to run & test classification


Classifying ClassifiersClassifying Classifiers

StatisticalStatistical StructuralStructural

RegressionRegression Naïve BayesNaïve Bayes

Bayesian NetworksBayesian Networks

Rule-‐basedRule-‐based Distance-‐basedDistance-‐based

Neural NetworksNeural

Networks

Production RulesProduction Rules Decision TreesDecision Trees

Multi-‐layer PerceptionMulti-‐layer Perception

FunctionalFunctional Nearest NeighborNearest Neighbor

LinearLinear Spectral WaveletSpectral Wavelet

kNNkNN Learning vector Quantization

Learning vector Quantization

EnsembleEnsemble

Random ForestsRandom Forests

Logistic Regression1Logistic

Regression1SVMSVMBoostingBoosting

1Max Entropy Classifier 1Max Entropy Classifier

Ref: Algorithms of the Intelligent Web, Marmanis & Babenko

Classifiers

RegressionContinuousVariables

Categorical Variables

Decision Trees

k-‐NN(Nearest Neighbors)

BiasVariance

Model ComplexityOver-fitting

BoostingBagging

CART

Classification - Spark APIClassification - Spark APIo Logistic Regressiono SVMWIthSGDo DecisionTreeso Data as LabelledPoint (we will see in a moment)o DecisionTree.trainClassifier(data, numClasses, categoricalFeaturesInfo, impurity="gini",

maxDepth=4, maxBins=100)o Impurity – “entropy” or “gini”o maxBins = control to throttle communication at the expense of accuracy• Larger = Higher Accuracy• Smaller = less communication (as # of bins = number of instances)

o data adaptive – i.e. decision tree samples on the driver and figures out the bin spacing i.e. the places you slice for binning

o intelligent framework - need this for scale

Lookout for these interesting Spark featuresLookout for these interesting Spark featureso Concept of Labeled Point & how to create an RDD of LPso Print the treeo Calculate Accuracy & MSE from RDDs

Read data & extract featuresRead data & extract features


Create the modelCreate the model

Extract labels & featuresExtract labels & features

Calculate Accuracy & MSECalculate Accuracy & MSE

Use NaiveBayes AlgorithmUse NaiveBayes Algorithm

Decision Tree – Best PracticesDecision Tree – Best Practices

maxDepth Tune with Data/Model Selection

maxBins Set low, monitor communications, increase if needed

# RDD partitions Set to # of cores• Usually the recommendation is that the RDD partitions should be over

partitioned ie “more partitions than cores”, because tasks take different times, we need to utilize the compute power and in the end they average out

• But for Machine Learning especially trees, all tasks are approx equal computationally intensive, so over partitioning doesn’t help

• Joe Bradley talk (reference below) has interesting insights

https://speakerdeck.com/jkbradley/mllib-‐decision-‐trees-‐at-‐sf-‐scala-‐baml-‐meetup

DecisionTree.trainClassifier(data, numClasses, categoricalFeaturesInfo, impurity="gini", maxDepth=4, maxBins=100)

Future …Future …o Actually we should split the data to training & test setso Then use different feature sets to see if we can increase the accuracyo Leave it as Homeworko In 1.2 …o Random Forest • Bagging• PR for Random Forest

o Boostingo Alpine lab sequoia Forest: coordinating mergeoModel Selection Pipeline ; Design Doc

◦ “Output of weak classifiers into a powerful committee”◦ Final Prediction = weighted majority vote ◦ Later classifiers get misclassified points � With higher weight, � So they are forced � To concentrate on them◦ AdaBoost (AdaptiveBoosting)◦ Boosting vs Bagging� Bagging – independent trees <-‐ Spark shines here� Boosting – successively weighted

BoostingBoosting� Goal◦ Model Complexity (-)◦ Variance (-)◦ Prediction Accuracy (+)

◦ Builds large collection of de-‐correlated trees & averages them◦ Improves Bagging by selecting i.i.d* random variables for

splitting◦ Simpler to train & tune◦ “Do remarkably well, with very little tuning required” – ESLII◦ Less susceptible to over fitting (than boosting)◦ Many RF implementations� Original version -‐ Fortran-‐77 ! By Breiman/Cutler� Python, R, Mahout, Weka, Milk (ML toolkit for py), matlab

* i.i.d – independent identically distributed+ http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm

Random Forests+Random Forests+

� Goal◦ Model Complexity (-)◦ Variance (-)◦ Prediction Accuracy (+)

◦ Two Step� Develop a set of learners� Combine the results to develop a composite predictor

◦ Ensemble methods can take the form of:� Using different algorithms, � Using the same algorithm with different settings� Assigning different parts of the dataset to different classifiers◦ Bagging & Random Forests are examples of ensemble

method

Ref: Machine Learning In Action

Ensemble MethodsEnsemble Methods� Goal◦ Model Complexity (-)◦ Variance (-)◦ Prediction Accuracy (+)

Random ForestsRandom Forestso While Boosting splits based on best among all variables, RF splits based on best among

randomly chosen variableso Simpler because it requires two variables – no. of Predictors (typically √k) & no. of trees

(500 for large dataset, 150 for smaller)o Error prediction• For each iteration, predict for dataset that is not in the sample

(OOB data)• Aggregate OOB predictions• Calculate Prediction Error for the aggregate, which is

basically the OOB estimate of error rate• Can use this to search for optimal # of predictors• We will see how close this is to the actual error in the

Heritage Health Prizeo Assumes equal cost for mis-prediction. Can add a cost functiono Proximity matrix & applications like adding missing data, dropping outliers Ref: R News Vol 2/3, Dec 2002

Statistical Learning from a Regression Perspective : BerkA Brief Overview of RF by Dan Steinberg

Why didn’t RF do better ? Bias/VarianceWhy didn’t RF do better ? Bias/VarianceoHigh Bias• Due to Underfitting• Add more features• More sophisticated model• Quadratic Terms, complex equations,…

• Decrease regularization

oHigh Variance• Due to Overfitting• Use fewer features• Use more training sample• Increase Regularization

Prediction Error

Training Error

Ref: Strata 2013 Tutorial by Olivier Grisel

Learning Curve

Need more features or more complex model to improve

Need more data to improve

'Bias is a learner’s tendency to consistently learn the same wrong thing.' -- Pedro Domingos

http://www.slideshare.net/ksankar/data-‐science-‐folk-‐knowledge

BreakBreak

4:304:30

ClusteringClustering

4:454:45

Data Science “folk knowledge” (3 of A)Data Science “folk knowledge” (3 of A)oMore Data Beats a Cleverer Algorithm• Or conversely select algorithms that improve with data• Don’t optimize prematurely without getting more data

o Learn many models, not Just One• Ensembles ! – Change the hypothesis space• Netflix prize• E.g. Bagging, Boosting, Stacking

o Simplicity Does not necessarily imply Accuracyo Representable Does not imply Learnable• Just because a function can be represented does not

mean it can be learned

o Correlation Does not imply Causationo http://doubleclix.wordpress.com/2014/03/07/a-glimpse-of-google-nasa-peter-norvig/o A few useful things to know about machine learning -by Pedro Domingos

§ http://dl.acm.org/citation.cfm?id=2347755

Scenario – Clustering with SparkScenario – Clustering with Spark

o InterGallactic Airlines have the GallacticHoppers frequent flyer program & have data about their customers who participate in the program.

o The airlines execs have a feeling that other airlines will poach their customers if they do not keep their loyal customers happy.

o So the business want to customize promotions to their frequent flier program.o Can they just have one type of promotion ? o Should they have different types of incentives ?oWho exactly are the customers in their GallacticHoppers program ?o Recently they have deployed an infrastructure with Sparko Can Spark help in this business problem ?

Clustering - TheoryClustering - Theoryo Clustering is unsupervised learningoWhile the computers can dissect a dataset into “similar” clusters, it still needs

human direction & domain knowledge to interpret & guideo Two types:• Centroid based clustering – k-means clustering• Tree based Clustering – hierarchical clustering

o Spark implements the Scalable Kmeans++ • Paper : http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf

Lookout for these interesting Spark featuresLookout for these interesting Spark featureso Application of Statistics toolboxo Center & Scale RDDo Filter RDDs

Clustering - APIClustering - APIo from pyspark.mllib.clustering import KMeanso Kmeans.traino train(cls, data, k, maxIterations=100, runs=1, initializationMode="k-means||")o K = number of clusters to create, default=2o initializationMode = The initialization algorithm. This can be either "random" to

choose random points as initial cluster centers, or "k-means||" to use a parallel variant of k-means++ (Bahmani et al., Scalable K-Means++, VLDB 2012). Default: k-means||

o KMeansModel.predictoMaps a point to a cluster

DataData


Read Data & Create RDDRead Data & Create RDD

Train & PredictTrain & Predict

Calculate errorCalculate error

But Data is not evenBut Data is not even

So let us center & scale the data and try againSo let us center & scale the data and try again

Looks GoodLooks Good

Let us try with 5 clustersLet us try with 5 clusters

Let us map the cluster to our dataLet us map the cluster to our data

InterpretationInterpretation

C# AVG Interpretation

1

2

3

4

5

Note : • This is just a sample interpretation.• In real life we would “noodle” over the clusters & tweak

them to be useful, interpretable and distinguishable.• May be 3 is more suited to create targeted promotions

EpilogueEpilogueo KMeans in Spark has enough controlso It does a decent joboWe were able to control the clusters based on our experience (2 cluster is too

low, 10 is too high, 5 seems to be right)oWe can see that the Scalable KMeans has control over runs, parallelism et al.

(Home work : explore the scalability)oWe were able to interpret the results with domain knowledge and arrive at a

scheme to solve the business opportunityoNaturally we would tweak the clusters to fit the business viability. 20 clusters

with corresponding promotion schemes are unwieldy, even if the WSSE is the minimum.

Recommendation Engine Recommendation Engine

5:055:05

Recommendation & Personalization - SparkRecommendation & Personalization - Spark

Automated Analytics-‐ Let Data tell storyFeature Learning, AI, Deep Learning

Learning Models -‐ fit parameters as it gets more data

Dynamic Models –model selection based on context

o Knowledge Basedo Demographic Basedo Content Basedo Collaborative Filtering

o Item Basedo User Based

o Latent Factor based

o User Ratingo Purchasedo Looked/Not purchased

Spark implements the user based ALS collaborative filtering

Ref: ALS -‐ Collaborative Filtering for Implicit Feedback Datasets, Yifan Hu ; AT&T Labs., Florham Park, NJ ; Koren, Y. ; Volinsky, C.ALS-‐WR -‐ Large-‐Scale Parallel Collaborative Filtering for the Netflix Prize, Yunhong Zhou, Dennis Wilkinson, Robert Schreiber, Rong Pan

Spark Collaborative Filtering APISpark Collaborative Filtering APIo ALS.train(cls, ratings, rank, iterations=5, lambda_=0.01, blocks=-1)o ALS.trainImplicit(cls, ratings, rank, iterations=5, lambda_=0.01, blocks=-1,

alpha=0.01)oMatrixFactorizationModel.predict(self, user, product)oMatrixFactorizationModel.predictAll(self, usersProducts)

Read & ParseRead & Parse

Split & TrainSplit & Train

EvaluateEvaluate

EpilogueEpilogueoWe explored interesting APIs in Sparko ALS-Collab Filteringo RDD Operations• Join (HashJoin)• In memory, Grace, Recursive hash join

http://technet.microsoft.com/en-‐us/library/ms189313(v=sql.105).aspx

Questions ?Questions ?

4:454:45

ReferenceReference1. SF Scala & SF Bay Area Machine Learning, Joseph Bradley: Decision Trees on

Spark http://functional.tv/post/98342564544/sfscala-sfbaml-joseph-bradley-decision-trees-on-spark

2. http://stats.stackexchange.com/questions/21222/are-mean-normalization-and-feature-scaling-needed-for-k-means-clustering

3. http://stats.stackexchange.com/questions/19216/variables-are-often-adjusted-e-g-standardised-before-making-a-model-when-is

4. http://funny-pictures.picphotos.net/tongue-out-smiley-face/smile-day.net*wp-content*uploads*2012*01*Tongue-Out-Smiley-Face1.jpg/

5. https://speakerdeck.com/jkbradley/mllib-decision-trees-at-sf-scala-baml-meetup

6. http://www.rosebt.com/1/post/2011/10/big-data-analytics-maturity-model.html

7. http://blogs.gartner.com/matthew-davis/

Essential Reading ListEssential Reading Listo A few useful things to know about machine learning - by Pedro Domingos• http://dl.acm.org/citation.cfm?id=2347755

o The Lack of A Priori Distinctions Between Learning Algorithms by David H. Wolpert• http://mpdc.mae.cornell.edu/Courses/MAE714/Papers/lack_of_a_priori_distinctions_wolper

t.pdf

o http://www.no-free-lunch.org/o Controlling the false discovery rate: a practical and powerful approach to multiple testing Benjamini, Y.

and Hochberg, Y. C• http://www.stat.purdue.edu/~doerge/BIOINFORM.D/FALL06/Benjamini%20and%20Y%20FD

R.pdf

o A Glimpse of Googl, NASA,Peter Norvig + The Restaurant at the End of the Universe• http://doubleclix.wordpress.com/2014/03/07/a-glimpse-of-google-nasa-peter-norvig/

o Avoid these three mistakes, James Faghmo• https://medium.com/about-data/73258b3848a4

o Leakage in Data Mining: Formulation, Detection, and Avoidance• http://www.cs.umb.edu/~ding/history/470_670_fall_2011/papers/cs670_Tran_PreferredPap

er_LeakingInDataMining.pdf

For your reading & viewing pleasure … An ordered ListFor your reading & viewing pleasure … An ordered List

① An Introduction to Statistical Learning• http://www-bcf.usc.edu/~gareth/ISL/

② ISL Class Stanford/Hastie/Tibsharani at their best - Statistical Learning• http://online.stanford.edu/course/statistical-learning-winter-2014

③ Prof. Pedro Domingo• https://class.coursera.org/machlearning-001/lecture/preview

④ Prof. Andrew Ng• https://class.coursera.org/ml-003/lecture/preview

⑤ Prof. Abu Mostafa, CaltechX: CS1156x: Learning From Data• https://www.edx.org/course/caltechx/caltechx-cs1156x-learning-data-1120

⑥ Mathematicalmonk @ YouTube• https://www.youtube.com/playlist?list=PLD0F06AA0D2E8FFBA

⑦ The Elements Of Statistical Learning• http://statweb.stanford.edu/~tibs/ElemStatLearn/

http://www.quora.com/Machine-Learning/Whats-the-easiest-way-to-learn-machine-learning/

References:References:o An Introduction to scikit-learn, pycon 2013, Jake Vanderplas• http://pyvideo.org/video/1655/an-introduction-to-scikit-

learn-machine-learningo Advanced Machine Learning with scikit-learn, pycon 2013, Strata 2014, Olivier Grisel• http://pyvideo.org/video/1719/advanced-machine-learning-

with-scikit-learno Just The Basics, Strata 2013, William Cukierski & Ben Hamner• http://strataconf.com/strata2013/public/schedule/detail/27291

o The Problem of Multiple Testing• http://download.journals.elsevierhealth.com/pdfs/journals/193

4-1482/PIIS1934148209014609.pdfo Thanks to Ana Crisan for the Titanic inset. Picture courtesy http://emileeid.com/2012/02/11/titanic-3d-exclusive-posters/

The Beginning As The EndThe Beginning As The EndHow did we do ?

4:454:45

data science with spark

Technology

o dataframes

o volunteer

correlations o

notebooks o

glauhdo3 o

context of data science

spark packt publishing

spark stack o break