data science with spark

119
In Apache Spark Foundations of Data Science with Spark Foundations of Data Science with Spark July 16, 2015 @ksankar // doubleclix.wordpress.com

Upload: krishna-sankar

Post on 07-Aug-2015

344 views

Category:

Technology


10 download

TRANSCRIPT

Page 1: Data Science with Spark

In Apache Spark

Foundations of Data Science with Spark

Foundations of Data Science with Spark

July 16, 2015

@ksankar // doubleclix.wordpress.com

Page 2: Data Science with Spark

www.globalbigdataconference.com

Twitter : @bigdataconf

Page 3: Data Science with Spark

o Intro & Setup [8:00-8:20)• Goals/non-goals

o Spark & Data Science DevOps [8:20-8:40)• Spark in the context of Data

ScienceoWhere Exactly is Apache Spark headed ?

[8:40-9:30)• Spark Yesterday, Today &

Tomorrow• Spark Stack

o Break [9:30-10:00)

oDataFrames for the Data Scientist [10:00-11:30)• pySpark Classes• Walkthru DataFrames• Hands-on Notebooks

o [15] Discussions/Slack (11:30 - 11:45)

Agenda : Introduction To SparkAgenda : Introduction To Sparkhttp://globalbigdataconference.com/52/santa-clara/big-data-developer-conference/schedule.html

Page 4: Data Science with Spark

o Review (2:00-2:30)• 004-Orders-Homework-

Solution• MLlib Statistical Toolbox• Summary, Correlations

o [20] Linear Regression (2:30-2:45)o [20] “Mood Of the Union” (2:45-3:15)• State of the Union w/

Washington, Lincoln, FDR, JFK, Clinton, Bush & Obama • Map reduce, parse text

o Break (3:15-3:30)

o [60] Predicting Survivors with Classification (3:30-4:30)• Decision Trees• NaiveBayes (Titanic data set)

o Break (4:30-4:45) o [20] Clustering(4:45-5:05)• K-means for Gallactic Hoppers!

o [20]Recommendation Engine (5:05-5:25)• Collab Filtering w/movie lens

o [15] Discussions/Slack (5:45-6:00)

Agenda : Data Wrangling w/ DataFrames & MLlibAgenda : Data Wrangling w/ DataFrames & MLlib

http://globalbigdataconference.com/52/santa-clara/big-data-developer-conference/schedule.html

Page 5: Data Science with Spark

Goals & non-goalsGoals & non-goalsGoals

¤Understand how to program Machine Learning with Spark & Python

¤Focus on programming & ML application

¤Give you a focused time to work thru examples§ Work with me. I will wait

if you want to catch-up¤Less theory, more usage - let us

see if this works¤As straightforward as possible§ The programs can be

optimized

Non-goals¡Go deep into the algorithms• We don’t have sufficient

time. The topic can be easily a 5 day tutorial !

¡Dive into spark internals• That is for another day

¡The underlying computation, communication, constraints & distribution is a fascinating subject• Paco does a good job

explaining them¡A passive talk• Nope. Interactive &

hands-on

Page 6: Data Science with Spark

About MeAbout Meo Data Scientist • Decision Data Science & Product Data Science

• Insights = Intelligence + Inference + Interface [https://goo.gl/s2KB6L]• Predicting NFL with Elo like Nate Silver & 538 [NFL : http://goo.gl/Q2OgeJ, NBA’15 : https://goo.gl/aUhdo3]

o Have been speaking at OSCON [http://goo.gl/1MJLu], PyCon, Pydata [http://vimeo.com/63270513, http://www.slideshare.net/ksankar/pydata-19] …

o Full-day Spark workshop “Advanced Data Science w/ Spark” / Spark Summit-E’15[https://goo.gl/7SBKTC]o Co-author : “Fast Data Processing with Spark”, Packt Publishing [http://goo.gl/eNtXpT]o Reviewer : “Machine Learning with Spark” Packt Publishingo Have done lots of things:• Big Data (Retail, Bioinformatics, Financial, AdTech), Starting MS-CFRM, University of WA• Written Books (Web 2.0, Wireless, Java,…))Standards, some work in AI,

• Guest Lecturer at Naval PG School,…

o Volunteer as Robotics Judge at First Lego league World Competitionso @ksankar, doubleclix.wordpress.com

Page 7: Data Science with Spark

Close EncountersClose Encounters� 1st◦ This Tutorial

� 2nd◦ Do More Hands-on Walkthrough

� 3nd◦ Listen To Lectures◦ More competitions …

Page 8: Data Science with Spark

Spark InstallationSpark Installationo Install Spark 1.4.1 in local Machine• https://spark.apache.org/downloads.html• Pre-built For Hadoop 2.6 is fine

• Download & uncompress• Remember the path & use it wherever you see /usr/local/spark/• I have downloaded in /usr/local & have a softlink spark to the latest version

o Install iPython

Page 9: Data Science with Spark

Tutorial MaterialsTutorial MaterialsoGithub : https://github.com/xsankar/global-bd-conf• Clone or download zip

oOpen terminalo cd ~/global-bd-confo IPYTHON=1 IPYTHON_OPTS="notebook” /usr/local/spark/bin/pyspark --packages com.databricks:spark-csv_2.11:1.0.3

oNotes : • I have a soft link “spark” in my /usr/local that points to the spark version

that I use. For example ln -s spark-1.4.1/ spark

o Click on ipython dashboardo Run 000-PreFlightCheck.ipynbo Run 001-TestSparkCSV.ipynboNow you are ready for the workshop !

Page 10: Data Science with Spark

Spark & Data Science DevOpsSpark & Data Science DevOps

8:208:20

Page 11: Data Science with Spark

Spark in the context of data scienceSpark in the context of data science

Page 12: Data Science with Spark

Data Science :

The art of building a model with known knowns, which when let loose, works with unknown unknowns!

Data Science :

The art of building a model with known knowns, which when let loose, works with unknown unknowns!

Donald Rumsfeld is an armchair Data Scientist !

http://smartorg.com/2013/07/valuepoint19/

The World

Knowns

Unknowns

YouUnKnown Known

o Others  know,  you  don’t o What  we  do

o Facts,  outcomes  or  scenarios  we  have  not  encountered,  nor  considered

o “Black  swans”,  outliers,  long  tails  of  probability  distributions

o Lack  of  experience,  imagination

o Potential   facts,  outcomes  we  are  aware,  but  not    with  certainty

o Stochastic  processes,  Probabilities

o Known Knownso There are things we know that we know

o Known Unknownso That is to say, there are things that we

now know we don't knowo But there are also Unknown Unknowns

o There are things we do not know we don't know

Page 13: Data Science with Spark

The curious case of the Data ScientistThe curious case of the Data ScientistoData Scientist is multi-faceted & ContextualoData Scientist should be building Data ProductsoData Scientist should tell a story

http://doubleclix.wordpress.com/2014/01/25/the-­‐curious-­‐case-­‐of-­‐the-­‐data-­‐scientist-­‐profession/

Large is hard; Infinite is much easier !– Titus Brown

Page 14: Data Science with Spark

Data Science - ContextData Science - Context

o Scalable  Model  Deployment

o Big  Data  automation  &  purpose  built  appliances  (soft/hard)

o Manage  SLAs  &  response   times

o Scalable  Model  Deployment

o Big  Data  automation  &  purpose  built  appliances  (soft/hard)

o Manage  SLAs  &  response   times

o Volumeo Velocityo Streaming  Data

o Volumeo Velocityo Streaming  Data

o Canonical   formo Data  catalogo Data  Fabric  across  the  

organizationo Access  to  multiple  

sources  of  data  o Think  Hybrid  – Big  Data  

Apps,  Appliances  &  Infrastructure

o Canonical   formo Data  catalogo Data  Fabric  across  the  

organizationo Access  to  multiple  

sources  of  data  o Think  Hybrid  – Big  Data  

Apps,  Appliances  &  Infrastructure

CollectCollect StoreStore TransformTransform

o Metadatao Monitor  counters  &  

Metricso Structured  vs.  Multi-­‐

structured

o Metadatao Monitor  counters  &  

Metricso Structured  vs.  Multi-­‐

structured

o Flexible  &  Selectable§ Data  Subsets  § Attribute  sets

o Flexible  &  Selectable§ Data  Subsets  § Attribute  sets

o Refine  model  with§ Extended  Data  

subsets§ Engineered  

Attribute  setso Validation  run  across  a  

larger  data  set

o Refine  model  with§ Extended  Data  

subsets§ Engineered  

Attribute  setso Validation  run  across  a  

larger  data  set

ReasonReason ModelModel DeployDeploy

Data ManagementData Management

Data ScienceData Science

o Dynamic  Data  Setso 2  way  key-­‐value  tagging  of  

datasetso Extended  attribute  setso Advanced  Analytics

o Dynamic  Data  Setso 2  way  key-­‐value  tagging  of  

datasetso Extended  attribute  setso Advanced  Analytics

ExploreExploreVisualizeVisualize RecommendRecommend PredictPredict

o Performanceo Scalabilityo Refresh  Latencyo In-­‐memory  Analytics

o Performanceo Scalabilityo Refresh  Latencyo In-­‐memory  Analytics

o Advanced  Visualizationo Interactive  Dashboardso Map  Overlayo Infographics

o Advanced  Visualizationo Interactive  Dashboardso Map  Overlayo Infographics

¤ Bytes to Business a.k.a. Build the full stack

¤ Find Relevant Data For Business

¤ Connect the Dots

Page 15: Data Science with Spark

VolumeVolume

VelocityVelocity

VarietyVariety

Data Science - ContextData Science - Context

ContextContext

ConnectednessConnectedness

IntelligenceIntelligence

InterfaceInterface

InferenceInference

“Data of unusual size” that can't be brute forced

o Three Amigoso Interface = Cognitiono Intelligence = Compute(CPU) & Computational(GPU)o Infer Significance & Causality

Page 16: Data Science with Spark

Day in the life of a (super) ModelDay in the life of a (super) Model

IntelligenceIntelligence

InferenceInference

Data RepresentationData Representation

InterfaceInterface

AlgorithmsAlgorithms

ParametersParametersAttributesAttributes

Data  (Scoring)Data  (Scoring)

Model  SelectionModel  Selection

Reason  &  LearnReason  &  Learn

ModelsModels

Visualize,  Recommend,  Explore

Visualize,  Recommend,  Explore

Model  AssessmentModel  Assessment

Feature  SelectionFeature  SelectionDimensionality  ReductionDimensionality  Reduction

Page 17: Data Science with Spark

Data Science Maturity Model & SparkData Science Maturity Model & SparkIsolated Analytics Integrated Analytics Aggregated Analytics Automated Analytics

Data Small  Data Larger  Data  set Big  Data Big  Data  Factory  Model

Context Local Domain Cross-­‐domain +  External Cross  domain  +  External

Model, Reason & Deploy

• Single  set  of  boxes,  usually  owned  by  the  Model  Builders

• Departmental

• Deploy  -­‐ Central  Analytics  Infrastructure

• Models  still  owned  &  operated  by  Modelers

• Partly Enterprise-­‐wide

• Central  Analytics  Infrastructure• Model  &  Reason  – by  Model  Builders• Deploy,  Operate  – by  ops• Residuals and  other  metrics  monitored  

by  modelers• Enterprise-­‐wide

• Distributed  Analytics  Infrastructure• AI  Augmented  models• Model  &  Reason  – by  Model  

Builders• Deploy,  Operate  – by  ops• Data  as  a  monetized  service,  

extending  to  eco  system  partners

• Reports • Dashboards • Dashboards  +  some  APIs • Dashboards  +  Well  defined  APIs  +  programming  models

Type • Descriptive  &  Reactive • +  Predictive • +  Adaptive • Adaptive

Datasets • All  in  the  same  box • Fixed  data  sets,  usually  in  temp  data  spaces

• Flexible  Data  &  Attribute  sets • Dynamic  datasets  with  well-­‐defined  refresh   policies  

Workload • Skunk works • Business  relevant  apps  with  approx SLAs

• High  performance  appliance  clusters • Appliances  and  clusters  for  multiple  workloads  including  real  time  apps

• Infrastructure  for  emerging  technologies

Strategy • Informal  definitions • Data  definitions  buried  in  the  analytics  models

• Some  data  definitions • Data  catalogue,  metadata  &  Annotations

• Big  Data  MDM  Strategy

Page 18: Data Science with Spark

The  Sense  &  Sensibility  of  a  DataScientist DevOpsThe  Sense  &  Sensibility  of  a  DataScientist DevOps

Factory  =  Operational

Lab  =  Investigative

http://doubleclix.wordpress.com/2014/05/11/the-­‐sense-­‐sensibility-­‐of-­‐a-­‐data-­‐scientist-­‐devops/

Page 19: Data Science with Spark

Where exactly is Apache Spark headed ? Where exactly is Apache Spark headed ?

Spark Yesterday, Today & Tomorrow …

“Unified engine across diverse data sources, workloads & environments”

8:408:40

Page 20: Data Science with Spark

http://free-­‐stock-­‐illustration.com/winding+road+clip+art

Spark 1.x• Fast engine for big data processing• Fast to run code & fast to write code• In-memory computation graphs with

compatibility with the Hadoop eco system and an interesting very usable APIs

• Iterative & interactive apps that operated on data multiple times, which are not a good use case for Hadoop.

Spark 1.3 & beyond has been the catalyst for a renaissance in Data Science !

Spark 1.4+• Multi-pass analytics - ML pipelines, GraphX• Ad-hoc queries - DataFrames• Real-time stream processing – Spark Streaming• Parallel Machine Learning Algorithms beyond the

basic RDDs• More types of data sources as input & output• More integration with R to span statistical computing

beyond “single-node tools”• More integration with apps like visualization

dashboards• More performance with even larger datasets &

complex applications – Project Tungsten

Spark Yesterday, Today & Tomorrow …

Page 21: Data Science with Spark

Spark DirectionsSpark Directions

Data ScienceData Science Platform APIsPlatform APIs Streaming, DAG Visualization & Debugging

Streaming, DAG Visualization & Debugging

Execution Optimization(Project Tungsten)Execution Optimization(Project Tungsten)o DataFrames

o ML  Pipelineso SparkR

o Growing  the  eco  system§ Data  Sources  -­‐Uniform  access  to  

diverse  data  sources§ Pluggable   “smart”  DataSource

API  for  reading/writing  DataFrame while  minimizing   I/O

§ Spark  Packages§ Deployment  utilities  for  Google  

Compute,  Azure  &  Job  Server

o Focus  on  CPU  Efficiencyo Run-­‐Time  Code  Generationo Cache  Locality  &  cache  aware  

data  structureso Binary  Format  for  aggregationso Spark  managed  Memory

o Off-­‐heap  memory  management

o Spark  Streaming  flow  control  &  optimized   state  management

Page 22: Data Science with Spark

Spark-The (simple) StackSpark-The (simple) Stack

Page 23: Data Science with Spark

RDD – The workhorse of Core SparkRDD – The workhorse of Core Spark

o Resilient Distributed Datasets• Collection that can be operated in parallel

o Transformations – create RDDs• Map, Filter,…

o Actions – Get values• Collect, Take,…

oWe will apply these operations during this tutorial

Page 24: Data Science with Spark

DataFrame API

Spark Core

Spark SQL Spark Streaming

Spark R MLlib GraphX Packages

ML Pipelines Advanced Analytics

Neural Networks

Deep Learning

Parameter Server

R Scala JavaPython

Catalyst Optimizer – optimize execution plan

Data Sources - Parquet, Hadoop, Cassandra, JSON, CSV, JDBC,…

Tungsten Execution

RDD

Page 25: Data Science with Spark

SQL Query

DataFrame

Unresolved Logical Plan

Logical Plan

Optimized Logical Plan

Physical PlansPhysical PlansPhysical Plans

Cost

Mod

el

Selected Physical Plan RDDs

Catalog

AnalysisLogical  

Optimization Physical  Planning Code  Generation

Query Optimization-Execution pipeline

Ref:  Spark  SQL  paper

Page 26: Data Science with Spark

Spark DataFrames for the Data ScientistSpark DataFrames for the Data Scientist

“A towel is about the most massively useful thing an interstellar hitchhiker can have … any man who can hitch the length and breadth of the Galaxy, rough it … win through, and still know where his towel is, is clearly a man to be reckoned with.”

- From The Hitchhiker's Guide to the Galaxy, by Douglas Adams.

DataFrames ! The Most Massively useful thing a Data Scientist can have …

10:0010:00

Page 27: Data Science with Spark

Data Science “folk knowledge” (1 of A)Data Science “folk knowledge” (1 of A)o "If you torture the data long enough, it will confess to anything." – Hal Varian,

Computer Mediated Transactionso Learning = Representation + Evaluation + Optimizationo It’s Generalization that counts• The fundamental goal of machine learning is to generalize beyond the

examples in the training set

oData alone is not enough• Induction not deduction - Every learner should embody some knowledge

or assumptions beyond the data it is given in order to generalize beyond it

oMachine Learning is not magic – one cannot get something from nothing• In order to infer, one needs the knobs & the dials• One also needs a rich expressive dataset

A few useful things to know about machine learning - by Pedro Domingoshttp://dl.acm.org/citation.cfm?id=2347755

Page 28: Data Science with Spark

pyspark

pyspark.SparkContext()pyspark.SparkConf()pyspark.RDD()pyspark.Broadcast()pyspark.Accululator()pyspark.SparkFiles()pyspark.StorageLevel()

pyspark.sqlpyspark.streaming pyspark.mllib pyspark.ml

pyspark.streaming.StreamingContext()pyspark.streaming.Dstream()pyspark.streaming.kafkapyspark.streaming.kafka.Broker()…

pyspark.sql.SQLContext()pyspark.sql.DataFrame()pyspark.sql.DataFrameNaFunctions()pyspark.sql.DataFrameStatFunctions()pyspark.sql.DataFrameReader()pyspark.sql.DataFrameWriter()pyspark.sql.Column()pyspark.sql.Row()pyspark.sql.functions()pyspark.sql.types()pyspark.sql.Window()pyspark.sql.WindowSpec()pyspark.sql.GroupedData()pyspark.sql.HiveContext()

pyspark.mllib.classificationpyspark.mllib.clusteringpyspark.mllib.evaluationpyspark.mllib.featurepyspark.mllib.fpmpyspark.mllib.linalgpyspark.mllib.randompyspark.mllib.recommendationpyspark.mllib.regressionpyspark.mllib.statpyspark.mllib.treepyspark.mllib.util

ML  Pipeline  APIspyspark.ml.Transformerpyspark.ml.Estimatorpyspark.ml.Modelpyspark.ml.Pipelinepyspark.ml.PipelineModel

pyspark.ml.parampyspark.ml.featurepyspark.ml.classificationpyspark.ml.recommendationpyspark.ml.regressionpyspark.ml.tuningpyspark.ml.evaluation

Page 29: Data Science with Spark

1. SparkContext()1. SparkContext()

Page 30: Data Science with Spark

2. Read/Write2. Read/Write

Page 31: Data Science with Spark

3. Convert3. Convert

pyspark.sql.DataFrame

table

pandas.DataFrame

sqlContext.registerDataFrameAsTable(df,  "aTable")

df2  =  sqlContext.table("aTable")

df =  createDataFrame(pandas.DataFrame)

p_df =  df.toPandas()

Page 32: Data Science with Spark

4. Columns & Rows (1 of 3)4. Columns & Rows (1 of 3)o Select a column• by the df(“<columnName>”) notation or the df.<columnName> notation. • The recommended way is the df(“<columnName>”), reason being a column

name can collide with a dataframe method if we use the df.<columnName>

o Column-wise operations line +,-, *,/,% (modulo),&&,||, <,<=,> and >=• df(“total”) = df(“price”) * df(“qty”) • inequality operator is !==, the usual equalTo operator is === and an

equality test that is safe for null values <=>

oMeta operations – type conversion (cast), alias, not null, …• df_cars.mpg.cast("double").alias('mpg')

o Run arbitrary udfs on a column (see next page)

Page 33: Data Science with Spark

4. Columns & Rows (2 of 3)4. Columns & Rows (2 of 3)Run arbitrary udfs on a column

Page 34: Data Science with Spark

4. Columns & Rows (3 of 3)4. Columns & Rows (3 of 3)Interesting Operations …

Adding a column…

Page 35: Data Science with Spark

5. DataFrame : RDD-like Operations5. DataFrame : RDD-like OperationsFunction Description  df.sort(<sort  expression>)  or  df.orderBy(<sort  expression>) Returns  a  sorted  DataFrame.  There  are  multiple  ways  of  specifying  the  sort  expression.  Use  of  the  orderBy  is  

recommended  (as  the  syntax  is  closer  to  SQL)  for  example:df_orders_1.groupBy("CustomerID","Year").count()   .orderBy(‘count’,ascending=False).show()

df.filter(<condition>)   or  df.where(<condition>) Returns  a  new  DataFrame  after  applying  the  <condition>.  The  condition  is  usually  based  on  a  column.Use  of  the  where  form  is  recommended  (as  the  syntax  is  closer  to  the  SQL  world),  for  example  df_orders.where(df_orders.[‘shipCountry’]  ==  ’France’)

df.colasce(n) Returns  a  DataFrame  with  n  partitions,  same  as  colasce(n)  method  of  RDD

df.foreach(<function>) Applies  a  function  on  all  the  rows  of  a  DataFramedf.map(lambda   r:..) Applies  the  function  on  all  the  rows  and  returns  the  resulting  object

df.flatMap(lambda   r:  …) Returns  an  RDD,  flattened,   after  applying  the  function  on  all  the  rows  of  the  DataFrame  

df.rdd() Returns  the  DataFrame  as  an  rdd  of  Row  objectsdf.na.replace([<list   of  values  to  be  replaced],[list  of  replacing  values],subset=[list  of  columns])  or    DataFrame.replace()   or  DataFrameNaFunctions.replace()

An  interesting  function,  very  useful  and  a  little  strange  syntax-­‐‑wise.  The  recommended  form  is  the  df.na.replace()   even  though  the  .na namespace   throws  it  a  little  bit.  Use  the  subset=  for  column  names.  The  syntax  is  different   from  the  Scala  syntax.

Page 36: Data Science with Spark

6. DataFrame : Action6. DataFrame : Actiono cache()o collect(), collectAsList()o count()o describe(), o first(), head(), show(), take()o…

Page 37: Data Science with Spark

7. DataFrame : Scientific Functions7. DataFrame : Scientific Functions

Page 38: Data Science with Spark

8. DataFrame : Statistical Functions8. DataFrame : Statistical Functions

The pair-wise frequency (contingency table) of transmission type and no of speeds show interesting observation. • All automatic cars in the dataset are 3 speed while most of the manual transmission cars have 4 or 5 speeds• Almost all the manual cars have 2 barrels while the automatic cars have 2 and 4 barrels

Page 39: Data Science with Spark

9. DataFrame : Aggregate Functions9. DataFrame : Aggregate Functionso The pyspark.sql.functions class (and the org.apache.spark.sql.functions for Scala)

contains the aggregation functionso There are two types of aggregations, one on column values and the other on

subsets of column values i.e. grouped values of some other columns• pyspark.sql.functions.avg(“sales”)• pyspark.sql.functions.groupby(“year”).agg(“sales”:”avg”)

o Count(), countDistinct()o First(),last()

Page 40: Data Science with Spark

10. DataFrame : na10. DataFrame : nao One of the tenets of big data and data science is that data is never fully clean-while we

can handle types, formats et al, missing values is always challengingo One easy solution is to drop the rows that have missing values, but then we would lose

valuable data in the columns that do have values. o A better solution is to impute data based on some criteria. It is true that data cannot be

created out of thin air, but data can be inferred with some success – it is better than dropping the rows. • We can replace null with 0• A better solution is to replace numerical values with the average of the rest of the valid

values; for categorical replacing with the most common value is a good strategy• We could use mode or median instead of mean

• Another good strategy is to infer the missing value from other attributes ie “Evidence from multiple fields”.• For example the Titanic data has name and for imputing the missing age field, we could use the

Mr., Master, Mrs. Miss designation from the name and then fill-in the average of the age field from the corresponding designation. So a row with missing age with Master. In name would get the average age of all records with “Master.”

• There is also the filed for number of siblings and number of spouse. We could average the age based on the value of that field.

• We could even average the ages from different strategies.

Page 41: Data Science with Spark

10. DataFrame : na10. DataFrame : na

Page 42: Data Science with Spark

11. Joins/Set Operations a.k.a. 11. Joins/Set Operations a.k.a. Language Integrated Queries

Page 43: Data Science with Spark

12. SQL on Tables12. SQL on Tables

Page 44: Data Science with Spark

Hands-OnHands-Ono 003-DataFrame-For-DS• Understand and run the iPython Notebook

o 004-Orders• Homework – we will go thru the solution when we meet in the afternoon

Page 45: Data Science with Spark

Data Wrangling with SparkData Wrangling with Spark

2:002:00

Page 46: Data Science with Spark

Algorithm spectrumAlgorithm spectrum

o Regressiono Logito CARTo Ensemble

:Random Forest

o Clusteringo KNNo Genetic Algo Simulated

Annealing

o CollabFiltering

o SVMo Kernels

o SVD

o NNeto Boltzman

Machineo Feature

Learning

Machine  Learning Cute  Math Artificial  Intelligence

Page 47: Data Science with Spark

Statistical ToolboxStatistical Toolboxo Sample data : Car mileage data

Page 48: Data Science with Spark
Page 49: Data Science with Spark

Linear Regression Linear Regression

2:302:30

Page 50: Data Science with Spark

Linear Regression - APILinear Regression - API

LabeledPoint The features and labels of a data pointLinearModel weights, interceptLinearRegressionModelBase predict()LinearRegressionModelLinearRegressionWithSGD

train(cls, data, iterations=100, step=1.0, miniBatchFraction=1.0, initialWeights=None, regParam=1.0, regType=None, intercept=False)

LassoModel Least-squares fit with an l_1 penalty term.

LassoWithSGDtrain(cls, data, iterations=100, step=1.0, regParam=1.0, miniBatchFraction=1.0,initialWeights=None)

RidgeRegressionModel Least-squares fit with an l_2 penalty term.

RidgeRegressionWithSGD

train(cls, data, iterations=100, step=1.0, regParam=1.0, miniBatchFraction=1.0, initialWeights=None)

Page 51: Data Science with Spark

Basic Linear RegressionBasic Linear Regression

Page 52: Data Science with Spark

Use LR model for prediction & calculate MSEUse LR model for prediction & calculate MSE

Page 53: Data Science with Spark

Step size is important, the model can diverge !Step size is important, the model can diverge !

Page 54: Data Science with Spark

Interesting step sizeInteresting step size

Page 55: Data Science with Spark
Page 56: Data Science with Spark
Page 57: Data Science with Spark

“Mood Of the Union” with TF-IDF “Mood Of the Union” with TF-IDF

2:452:45

Page 58: Data Science with Spark

Scenario – Mood Of the UnionScenario – Mood Of the Union

o It has been said that the State of the Union speech by the President of USA reflects the social challenge faced by the country ?

o If so, can we infer the mood of the country by analyzing SOTU ?o If we embark on this line of thought, how would we do it with Spark & python ?o Is it different from Hadoop-MapReduce ? o Is it better ?

Page 59: Data Science with Spark

POA (Plan Of Action)POA (Plan Of Action)o Collect State of the Union speech by George Washington, Abe Lincoln, FDR,

JFK, Bill Clinton, GW Bush & Barack Obama o Read the 7 SOTU from the 7 presidents into 7 RDDso Create word vectorso Transform into word frequency vectorso Remove stock common wordso Inspect to n words to see if they reflect the sentiment of the timeo Compute set difference and see how new words have cropped upo Compute TF-IDF (homework!)

Page 60: Data Science with Spark

Lookout for these interesting Spark featuresLookout for these interesting Spark featureso RDD Map-ReduceoHow to parse inputo Removing common wordso Sort rdd by value

Page 61: Data Science with Spark

Read & Create word vectorRead & Create word vector

iPython notebook at https://github.com/xsankar/cloaked-ironman

Page 62: Data Science with Spark

Remove Common Words – 1 of 3Remove Common Words – 1 of 3

iPython notebook at https://github.com/xsankar/cloaked-ironman

Page 63: Data Science with Spark

Remove Common Words – 2 of 3Remove Common Words – 2 of 3

Page 64: Data Science with Spark

Remove Common Words – 3 of 3Remove Common Words – 3 of 3

Page 65: Data Science with Spark

FDR vs. Barack Obama as reflected by SOTUFDR vs. Barack Obama as reflected by SOTU

Page 66: Data Science with Spark

Barack Obama vs. Bill ClintonBarack Obama vs. Bill Clinton

Page 67: Data Science with Spark

GWB vs Abe Lincoln as reflected by SOTUGWB vs Abe Lincoln as reflected by SOTU

Page 68: Data Science with Spark

EpilogueEpilogueo Interesting ExerciseoHighlights• Map-reduce in a couple of lines !• But it is not exactly the same as Hadoop Mapreduce (see the excellent blog by Sean Owen1)

• Set differences using substractByKey• Ability to sort a map by values (or any arbitrary function, for that matter)

o To Explore as homework:• TF-IDF in http://spark.apache.org/docs/latest/mllib-feature-extraction.html#tf-idf

http://blog.cloudera.com/blog/2014/09/how-­‐to-­‐translate-­‐from-­‐mapreduce-­‐to-­‐apache-­‐spark/

Page 69: Data Science with Spark

BreakBreak

3:153:15

Page 70: Data Science with Spark

Predicting Survivors with Classification Predicting Survivors with Classification

3:303:30

Page 71: Data Science with Spark

Data Science “folk knowledge” (Wisdom of Kaggle)Jeremy’s AxiomsData Science “folk knowledge” (Wisdom of Kaggle)Jeremy’s Axiomso Iteratively explore datao Tools• Excel Format, Perl, Perl Book, Spark !

o Get your head around data• Pivot Table

o Don’t over-complicateo If people give you data, don’t assume that you

need to use all of ito Look at pictures !o History of your submissions – keep a tabo Don’t be afraid to submit simple solutions• We will do this during this workshop

Ref: http://blog.kaggle.com/2011/03/23/getting-in-shape-for-the-sport-of-data-sciencetalk-by-jeremy-howard/

Page 72: Data Science with Spark

Titanic  Passenger  Metadata• Small• 3  Predictors

• Class• Sex• Age• Survived?

Classification - ScenarioClassification - Scenarioo This is a knowledge exerciseo Classify survival from the titanic dataoGives us a quick dataset to run & test classification

iPython notebook at https://github.com/xsankar/cloaked-ironman

Page 73: Data Science with Spark

Classifying ClassifiersClassifying Classifiers

StatisticalStatistical StructuralStructural

RegressionRegression Naïve  BayesNaïve  Bayes

Bayesian  NetworksBayesian  Networks

Rule-­‐basedRule-­‐based Distance-­‐basedDistance-­‐based

Neural  NetworksNeural  

Networks

Production   RulesProduction   Rules Decision  TreesDecision  Trees

Multi-­‐layer  PerceptionMulti-­‐layer  Perception

FunctionalFunctional Nearest  NeighborNearest  Neighbor

LinearLinear Spectral  WaveletSpectral  Wavelet

kNNkNN Learning  vector  Quantization

Learning  vector  Quantization

EnsembleEnsemble

Random  ForestsRandom  Forests

Logistic  Regression1Logistic  

Regression1SVMSVMBoostingBoosting

1Max  Entropy  Classifier  1Max  Entropy  Classifier  

Ref: Algorithms of the Intelligent Web, Marmanis & Babenko

Page 74: Data Science with Spark

Classifiers

RegressionContinuousVariables

Categorical Variables

Decision  Trees

k-­‐NN(Nearest  Neighbors)

BiasVariance

Model ComplexityOver-fitting

BoostingBagging

CART

Page 75: Data Science with Spark

Classification - Spark APIClassification - Spark APIo Logistic Regressiono SVMWIthSGDo DecisionTreeso Data as LabelledPoint (we will see in a moment)o DecisionTree.trainClassifier(data, numClasses, categoricalFeaturesInfo, impurity="gini",

maxDepth=4, maxBins=100)o Impurity – “entropy” or “gini”o maxBins = control to throttle communication at the expense of accuracy• Larger = Higher Accuracy• Smaller = less communication (as # of bins = number of instances)

o data adaptive – i.e. decision tree samples on the driver and figures out the bin spacing i.e. the places you slice for binning

o intelligent framework - need this for scale

Page 76: Data Science with Spark

Lookout for these interesting Spark featuresLookout for these interesting Spark featureso Concept of Labeled Point & how to create an RDD of LPso Print the treeo Calculate Accuracy & MSE from RDDs

Page 77: Data Science with Spark

Read data & extract featuresRead data & extract features

iPython notebook at https://github.com/xsankar/cloaked-ironman

Page 78: Data Science with Spark

Create the modelCreate the model

Page 79: Data Science with Spark

Extract labels & featuresExtract labels & features

Page 80: Data Science with Spark

Calculate Accuracy & MSECalculate Accuracy & MSE

Page 81: Data Science with Spark

Use NaiveBayes AlgorithmUse NaiveBayes Algorithm

Page 82: Data Science with Spark

Decision Tree – Best PracticesDecision Tree – Best Practices

maxDepth Tune  with  Data/Model  Selection

maxBins Set  low,  monitor communications,   increase  if  needed

#  RDD  partitions Set  to  #  of  cores• Usually the recommendation is that the RDD partitions should be over

partitioned ie “more partitions than cores”, because tasks take different times, we need to utilize the compute power and in the end they average out

• But for Machine Learning especially trees, all tasks are approx equal computationally intensive, so over partitioning doesn’t help

• Joe Bradley talk (reference below) has interesting insights

https://speakerdeck.com/jkbradley/mllib-­‐decision-­‐trees-­‐at-­‐sf-­‐scala-­‐baml-­‐meetup

DecisionTree.trainClassifier(data,  numClasses,  categoricalFeaturesInfo,  impurity="gini",   maxDepth=4,  maxBins=100)

Page 83: Data Science with Spark

Future …Future …o Actually we should split the data to training & test setso Then use different feature sets to see if we can increase the accuracyo Leave it as Homeworko In 1.2 …o Random Forest • Bagging• PR for Random Forest

o Boostingo Alpine lab sequoia Forest: coordinating mergeoModel Selection Pipeline ; Design Doc

Page 84: Data Science with Spark

◦ “Output  of  weak  classifiers  into  a  powerful  committee”◦ Final  Prediction  =  weighted  majority  vote  ◦ Later  classifiers  get  misclassified  points  � With  higher  weight,  � So  they  are  forced  � To  concentrate  on  them◦ AdaBoost (AdaptiveBoosting)◦ Boosting  vs Bagging� Bagging  – independent  trees  <-­‐ Spark  shines  here� Boosting  – successively  weighted

BoostingBoosting� Goal◦ Model Complexity (-)◦ Variance (-)◦ Prediction Accuracy (+)

Page 85: Data Science with Spark

◦ Builds  large  collection  of  de-­‐correlated  trees  &  averages  them◦ Improves  Bagging  by  selecting  i.i.d*  random  variables  for  

splitting◦ Simpler  to  train  &  tune◦ “Do  remarkably  well,  with  very  little   tuning  required”  – ESLII◦ Less  susceptible  to  over  fitting  (than  boosting)◦ Many  RF  implementations� Original  version  -­‐ Fortran-­‐77  !  By  Breiman/Cutler� Python,  R,  Mahout,  Weka,  Milk  (ML  toolkit  for  py),  matlab

* i.i.d – independent identically distributed+ http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm

Random Forests+Random Forests+

� Goal◦ Model Complexity (-)◦ Variance (-)◦ Prediction Accuracy (+)

Page 86: Data Science with Spark

◦ Two  Step� Develop  a  set  of  learners� Combine  the  results  to  develop  a  composite  predictor

◦ Ensemble  methods  can  take  the  form  of:� Using  different  algorithms,  � Using  the  same  algorithm  with  different  settings� Assigning  different  parts  of  the  dataset  to  different  classifiers◦ Bagging  &  Random  Forests  are  examples  of  ensemble  

method  

Ref: Machine Learning In Action

Ensemble MethodsEnsemble Methods� Goal◦ Model Complexity (-)◦ Variance (-)◦ Prediction Accuracy (+)

Page 87: Data Science with Spark

Random ForestsRandom Forestso While Boosting splits based on best among all variables, RF splits based on best among

randomly chosen variableso Simpler because it requires two variables – no. of Predictors (typically √k) & no. of trees

(500 for large dataset, 150 for smaller)o Error prediction• For each iteration, predict for dataset that is not in the sample

(OOB data)• Aggregate OOB predictions• Calculate Prediction Error for the aggregate, which is

basically the OOB estimate of error rate• Can use this to search for optimal # of predictors• We will see how close this is to the actual error in the

Heritage Health Prizeo Assumes equal cost for mis-prediction. Can add a cost functiono Proximity matrix & applications like adding missing data, dropping outliers Ref: R News Vol 2/3, Dec 2002

Statistical Learning from a Regression Perspective : BerkA Brief Overview of RF by Dan Steinberg

Page 88: Data Science with Spark

Why didn’t RF do better ? Bias/VarianceWhy didn’t RF do better ? Bias/VarianceoHigh Bias• Due to Underfitting• Add more features• More sophisticated model• Quadratic Terms, complex equations,…

• Decrease regularization

oHigh Variance• Due to Overfitting• Use fewer features• Use more training sample• Increase Regularization

Prediction Error

Training Error

Ref: Strata 2013 Tutorial by Olivier Grisel

Learning Curve

Need  more  features  or  more  complex  model  to  improve

Need  more  data  to  improve

'Bias is a learner’s tendency to consistently learn the same wrong thing.' -- Pedro Domingos

http://www.slideshare.net/ksankar/data-­‐science-­‐folk-­‐knowledge

Page 89: Data Science with Spark

BreakBreak

4:304:30

Page 90: Data Science with Spark

ClusteringClustering

4:454:45

Page 91: Data Science with Spark

Data Science “folk knowledge” (3 of A)Data Science “folk knowledge” (3 of A)oMore Data Beats a Cleverer Algorithm• Or conversely select algorithms that improve with data• Don’t optimize prematurely without getting more data

o Learn many models, not Just One• Ensembles ! – Change the hypothesis space• Netflix prize• E.g. Bagging, Boosting, Stacking

o Simplicity Does not necessarily imply Accuracyo Representable Does not imply Learnable• Just because a function can be represented does not

mean it can be learned

o Correlation Does not imply Causationo http://doubleclix.wordpress.com/2014/03/07/a-glimpse-of-google-nasa-peter-norvig/o A few useful things to know about machine learning -by Pedro Domingos

§ http://dl.acm.org/citation.cfm?id=2347755

Page 92: Data Science with Spark

Scenario – Clustering with SparkScenario – Clustering with Spark

o InterGallactic Airlines have the GallacticHoppers frequent flyer program & have data about their customers who participate in the program.

o The airlines execs have a feeling that other airlines will poach their customers if they do not keep their loyal customers happy.

o So the business want to customize promotions to their frequent flier program.o Can they just have one type of promotion ? o Should they have different types of incentives ?oWho exactly are the customers in their GallacticHoppers program ?o Recently they have deployed an infrastructure with Sparko Can Spark help in this business problem ?

Page 93: Data Science with Spark

Clustering - TheoryClustering - Theoryo Clustering is unsupervised learningoWhile the computers can dissect a dataset into “similar” clusters, it still needs

human direction & domain knowledge to interpret & guideo Two types:• Centroid based clustering – k-means clustering• Tree based Clustering – hierarchical clustering

o Spark implements the Scalable Kmeans++ • Paper : http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf

Page 94: Data Science with Spark

Lookout for these interesting Spark featuresLookout for these interesting Spark featureso Application of Statistics toolboxo Center & Scale RDDo Filter RDDs

Page 95: Data Science with Spark

Clustering - APIClustering - APIo from pyspark.mllib.clustering import KMeanso Kmeans.traino train(cls, data, k, maxIterations=100, runs=1, initializationMode="k-means||")o K = number of clusters to create, default=2o initializationMode = The initialization algorithm. This can be either "random" to

choose random points as initial cluster centers, or "k-means||" to use a parallel variant of k-means++ (Bahmani et al., Scalable K-Means++, VLDB 2012). Default: k-means||

o KMeansModel.predictoMaps a point to a cluster

Page 96: Data Science with Spark

DataData

iPython notebook at https://github.com/xsankar/cloaked-ironman

Page 97: Data Science with Spark

Read Data & Create RDDRead Data & Create RDD

Page 98: Data Science with Spark

Train & PredictTrain & Predict

Page 99: Data Science with Spark

Calculate errorCalculate error

Page 100: Data Science with Spark

But Data is not evenBut Data is not even

Page 101: Data Science with Spark

So let us center & scale the data and try againSo let us center & scale the data and try again

Page 102: Data Science with Spark

Looks GoodLooks Good

Let us try with 5 clustersLet us try with 5 clusters

Page 103: Data Science with Spark

Let us map the cluster to our dataLet us map the cluster to our data

Page 104: Data Science with Spark

InterpretationInterpretation

C# AVG Interpretation

1

2

3

4

5

Note  :  • This  is  just  a  sample  interpretation.• In  real  life  we  would  “noodle”  over  the  clusters  &  tweak  

them  to  be  useful,  interpretable  and  distinguishable.• May  be  3  is  more  suited  to  create  targeted  promotions

Page 105: Data Science with Spark

EpilogueEpilogueo KMeans in Spark has enough controlso It does a decent joboWe were able to control the clusters based on our experience (2 cluster is too

low, 10 is too high, 5 seems to be right)oWe can see that the Scalable KMeans has control over runs, parallelism et al.

(Home work : explore the scalability)oWe were able to interpret the results with domain knowledge and arrive at a

scheme to solve the business opportunityoNaturally we would tweak the clusters to fit the business viability. 20 clusters

with corresponding promotion schemes are unwieldy, even if the WSSE is the minimum.

Page 106: Data Science with Spark

Recommendation Engine Recommendation Engine

5:055:05

Page 107: Data Science with Spark

Recommendation & Personalization - SparkRecommendation & Personalization - Spark

Automated  Analytics-­‐ Let  Data  tell  storyFeature  Learning,  AI,  Deep  Learning

Learning  Models  -­‐ fit  parameters  as  it  gets  more  data  

Dynamic  Models  –model  selection  based  on  context

o Knowledge  Basedo Demographic  Basedo Content  Basedo Collaborative  Filtering

o Item  Basedo User  Based

o Latent  Factor  based

o User  Ratingo Purchasedo Looked/Not  purchased

Spark  implements  the  user  based  ALS  collaborative  filtering

Ref:  ALS  -­‐ Collaborative  Filtering  for  Implicit  Feedback  Datasets,  Yifan Hu  ;  AT&T  Labs.,  Florham  Park,  NJ  ;  Koren,   Y.  ;  Volinsky,   C.ALS-­‐WR  -­‐ Large-­‐Scale  Parallel  Collaborative   Filtering  for  the  Netflix  Prize,  Yunhong Zhou,  Dennis  Wilkinson,   Robert  Schreiber,   Rong Pan

Page 108: Data Science with Spark

Spark Collaborative Filtering APISpark Collaborative Filtering APIo ALS.train(cls, ratings, rank, iterations=5, lambda_=0.01, blocks=-1)o ALS.trainImplicit(cls, ratings, rank, iterations=5, lambda_=0.01, blocks=-1,

alpha=0.01)oMatrixFactorizationModel.predict(self, user, product)oMatrixFactorizationModel.predictAll(self, usersProducts)

Page 109: Data Science with Spark

Read & ParseRead & Parse

Page 110: Data Science with Spark

Split & TrainSplit & Train

Page 111: Data Science with Spark

EvaluateEvaluate

Page 112: Data Science with Spark

EpilogueEpilogueoWe explored interesting APIs in Sparko ALS-Collab Filteringo RDD Operations• Join (HashJoin)• In memory, Grace, Recursive hash join

http://technet.microsoft.com/en-­‐us/library/ms189313(v=sql.105).aspx

Page 113: Data Science with Spark

Questions ?Questions ?

4:454:45

Page 114: Data Science with Spark

ReferenceReference1. SF Scala & SF Bay Area Machine Learning, Joseph Bradley: Decision Trees on

Spark http://functional.tv/post/98342564544/sfscala-sfbaml-joseph-bradley-decision-trees-on-spark

2. http://stats.stackexchange.com/questions/21222/are-mean-normalization-and-feature-scaling-needed-for-k-means-clustering

3. http://stats.stackexchange.com/questions/19216/variables-are-often-adjusted-e-g-standardised-before-making-a-model-when-is

4. http://funny-pictures.picphotos.net/tongue-out-smiley-face/smile-day.net*wp-content*uploads*2012*01*Tongue-Out-Smiley-Face1.jpg/

5. https://speakerdeck.com/jkbradley/mllib-decision-trees-at-sf-scala-baml-meetup

6. http://www.rosebt.com/1/post/2011/10/big-data-analytics-maturity-model.html

7. http://blogs.gartner.com/matthew-davis/

Page 115: Data Science with Spark

Essential Reading ListEssential Reading Listo A few useful things to know about machine learning - by Pedro Domingos• http://dl.acm.org/citation.cfm?id=2347755

o The Lack of A Priori Distinctions Between Learning Algorithms by David H. Wolpert• http://mpdc.mae.cornell.edu/Courses/MAE714/Papers/lack_of_a_priori_distinctions_wolper

t.pdf

o http://www.no-free-lunch.org/o Controlling the false discovery rate: a practical and powerful approach to multiple testing Benjamini, Y.

and Hochberg, Y. C• http://www.stat.purdue.edu/~doerge/BIOINFORM.D/FALL06/Benjamini%20and%20Y%20FD

R.pdf

o A Glimpse of Googl, NASA,Peter Norvig + The Restaurant at the End of the Universe• http://doubleclix.wordpress.com/2014/03/07/a-glimpse-of-google-nasa-peter-norvig/

o Avoid these three mistakes, James Faghmo• https://medium.com/about-data/73258b3848a4

o Leakage in Data Mining: Formulation, Detection, and Avoidance• http://www.cs.umb.edu/~ding/history/470_670_fall_2011/papers/cs670_Tran_PreferredPap

er_LeakingInDataMining.pdf

Page 116: Data Science with Spark

For your reading & viewing pleasure … An ordered ListFor your reading & viewing pleasure … An ordered List

① An Introduction to Statistical Learning• http://www-bcf.usc.edu/~gareth/ISL/

② ISL Class Stanford/Hastie/Tibsharani at their best - Statistical Learning• http://online.stanford.edu/course/statistical-learning-winter-2014

③ Prof. Pedro Domingo• https://class.coursera.org/machlearning-001/lecture/preview

④ Prof. Andrew Ng• https://class.coursera.org/ml-003/lecture/preview

⑤ Prof. Abu Mostafa, CaltechX: CS1156x: Learning From Data• https://www.edx.org/course/caltechx/caltechx-cs1156x-learning-data-1120

⑥ Mathematicalmonk @ YouTube• https://www.youtube.com/playlist?list=PLD0F06AA0D2E8FFBA

⑦ The Elements Of Statistical Learning• http://statweb.stanford.edu/~tibs/ElemStatLearn/

http://www.quora.com/Machine-Learning/Whats-the-easiest-way-to-learn-machine-learning/

Page 117: Data Science with Spark

References:References:o An Introduction to scikit-learn, pycon 2013, Jake Vanderplas• http://pyvideo.org/video/1655/an-introduction-to-scikit-

learn-machine-learningo Advanced Machine Learning with scikit-learn, pycon 2013, Strata 2014, Olivier Grisel• http://pyvideo.org/video/1719/advanced-machine-learning-

with-scikit-learno Just The Basics, Strata 2013, William Cukierski & Ben Hamner• http://strataconf.com/strata2013/public/schedule/detail/27291

o The Problem of Multiple Testing• http://download.journals.elsevierhealth.com/pdfs/journals/193

4-1482/PIIS1934148209014609.pdfo Thanks to Ana Crisan for the Titanic inset. Picture courtesy http://emileeid.com/2012/02/11/titanic-3d-exclusive-posters/

Page 118: Data Science with Spark

The Beginning As The EndThe Beginning As The EndHow did we do ?

4:454:45

Page 119: Data Science with Spark