introduction to machine learning with spark
TRANSCRIPT
![Page 1: Introduction to Machine Learning with Spark](https://reader034.vdocuments.us/reader034/viewer/2022042600/587985f31a28ab6c358b65bf/html5/thumbnails/1.jpg)
Introduction to Machine Learning in Spark
Machine learning at Scale
https://github.com/phatak-dev/introduction_to_ml_with_spark
![Page 2: Introduction to Machine Learning with Spark](https://reader034.vdocuments.us/reader034/viewer/2022042600/587985f31a28ab6c358b65bf/html5/thumbnails/2.jpg)
● Madhukara Phatak
● Big data consultant and trainer at datamantra.io
● Consult in Hadoop, Spark and Scala
● www.madhukaraphatak.com
![Page 3: Introduction to Machine Learning with Spark](https://reader034.vdocuments.us/reader034/viewer/2022042600/587985f31a28ab6c358b65bf/html5/thumbnails/3.jpg)
Agenda● Introduction to Machine learning● Machine learning in Big data● Understanding mathematics ● Vector manipulation in Scala/Spark● Implementing ML in Spark RDD● Using MLLib● ML beyond MLLib
![Page 4: Introduction to Machine Learning with Spark](https://reader034.vdocuments.us/reader034/viewer/2022042600/587985f31a28ab6c358b65bf/html5/thumbnails/4.jpg)
3 D’s of big data processing● Data Scientist
Models simple world view from chaotic complex real world data
● Data EngineerImplements simple model on complex toolset and data
● Data ArtistExplains complex results in simple visualizations
![Page 5: Introduction to Machine Learning with Spark](https://reader034.vdocuments.us/reader034/viewer/2022042600/587985f31a28ab6c358b65bf/html5/thumbnails/5.jpg)
Introduction to Machine Learning
![Page 6: Introduction to Machine Learning with Spark](https://reader034.vdocuments.us/reader034/viewer/2022042600/587985f31a28ab6c358b65bf/html5/thumbnails/6.jpg)
What is machine learning?
A computer program is said to learn from experience E with respect to some task T and
some performance measure P, if it’s performance on T, as measured P, improves
with experience E - Tom Mitchell (1998)
![Page 7: Introduction to Machine Learning with Spark](https://reader034.vdocuments.us/reader034/viewer/2022042600/587985f31a28ab6c358b65bf/html5/thumbnails/7.jpg)
How human learn?● We repeat the same task (T) over and over again to
gain experience (E)● This doing over and over again same task is known as
practice● With practice and experience, we get better at that task● Once we achieve some level,we have learnt● Everything in life is learnt
![Page 8: Introduction to Machine Learning with Spark](https://reader034.vdocuments.us/reader034/viewer/2022042600/587985f31a28ab6c358b65bf/html5/thumbnails/8.jpg)
Learning example - Music instrument● We practice each day the instrument with same song
again and again● We will be pretty bad to start with, but with practice we
improve● Our teacher will measure performance to understand
did we done progress● Once your teacher approves, you are a player
![Page 9: Introduction to Machine Learning with Spark](https://reader034.vdocuments.us/reader034/viewer/2022042600/587985f31a28ab6c358b65bf/html5/thumbnails/9.jpg)
Why it’s hard to learn?● Practicing over and over same task is often boring and
frustrating● Without proper measurement, we may be stuck at same
level without making much progress ● Making progress initially is pretty easy but it’s starts to
get harder and harder● Sometime natural talent also dicates cap on how best
we can be
![Page 10: Introduction to Machine Learning with Spark](https://reader034.vdocuments.us/reader034/viewer/2022042600/587985f31a28ab6c358b65bf/html5/thumbnails/10.jpg)
How machine learn?● You start with an assumption about solution for given
problem● Make machine go over data and verify the assumption● Run the program over and over again on same data
updating assumption on the way● Measure improvement after each round, adjust
accordingly● Once you have no more assumption changing, stop
running
![Page 11: Introduction to Machine Learning with Spark](https://reader034.vdocuments.us/reader034/viewer/2022042600/587985f31a28ab6c358b65bf/html5/thumbnails/11.jpg)
Brute force machine learningProblem : Find relation between x , y and z values , given data set x y z 1 2 5 2 3 6 2 4 7 What human approach to this?
![Page 12: Introduction to Machine Learning with Spark](https://reader034.vdocuments.us/reader034/viewer/2022042600/587985f31a28ab6c358b65bf/html5/thumbnails/12.jpg)
ML approach● Let’s assume that x,y,z are connected as below z = c0 + c1x + c2y where c0, c1, c2 factors which we need to learn● Let’s start with following values
c0 = 0 c1 = 1 c2 = 1 ● Find out the error for each row● Adjust the values for c0,c1,c2 from error● Repeat
![Page 13: Introduction to Machine Learning with Spark](https://reader034.vdocuments.us/reader034/viewer/2022042600/587985f31a28ab6c358b65bf/html5/thumbnails/13.jpg)
ML history● Sub branch of AI● Impossible to model brain because it's not an ideal
world● Most of the time we don’t need all complexity of real
world ex : ideal gas equation from thermodynamics● Approximation of real world is more than enough to
solve real problems
![Page 14: Introduction to Machine Learning with Spark](https://reader034.vdocuments.us/reader034/viewer/2022042600/587985f31a28ab6c358b65bf/html5/thumbnails/14.jpg)
ML and Data● The quality of a model of ML is determined by
○ learning algorithm○ Amount of data it has seen
● More the data, a simple algorithm can do better than sophisticated algorithm with less data
● So for long ML was focused coming up with better algorithms as amount of data was limited
● Then we landed on big data
![Page 15: Introduction to Machine Learning with Spark](https://reader034.vdocuments.us/reader034/viewer/2022042600/587985f31a28ab6c358b65bf/html5/thumbnails/15.jpg)
Machine learning in Big data
![Page 16: Introduction to Machine Learning with Spark](https://reader034.vdocuments.us/reader034/viewer/2022042600/587985f31a28ab6c358b65bf/html5/thumbnails/16.jpg)
ML and Big data● Ability to learn on large corpus of data is a real boon for
ML● Even simplistic ML models shine when they are trained
on huge amount of data● With big data toolsets, a wide variety of ML application
have started to emerge other than academic specifics one
● Big data democratising ML for general public
![Page 17: Introduction to Machine Learning with Spark](https://reader034.vdocuments.us/reader034/viewer/2022042600/587985f31a28ab6c358b65bf/html5/thumbnails/17.jpg)
ML vs Big data Machine Learning Big data
Optimized for iterative computations Optimized for single pass computations
Maintains state between stages Stateless
CPU intensive Data intensive
Vector/Matrix based or multiple rows/cols ata time
Single column/ row at time
![Page 18: Introduction to Machine Learning with Spark](https://reader034.vdocuments.us/reader034/viewer/2022042600/587985f31a28ab6c358b65bf/html5/thumbnails/18.jpg)
Machine learning in Hadoop● In Hadoop M/R each iteration of ML translates into
single M/R job.● Each of these jobs need to store data in HDFS which
creates a lot of overhead● Keeping state across jobs is not directly available in
M/R● Constant fight between quality of result vs performance
![Page 19: Introduction to Machine Learning with Spark](https://reader034.vdocuments.us/reader034/viewer/2022042600/587985f31a28ab6c358b65bf/html5/thumbnails/19.jpg)
Machine learning in Spark● Spark is first general purpose big data processing
engine build for ML from day one● The initial design in Spark was driven by ML
optimization○ Caching - For running on data multiple times○ Accumulator - To keep state across multiple
iterations in memory○ Good support for cpu intensive tasks with laziness
● One of the examples in Spark first version was of ML
![Page 20: Introduction to Machine Learning with Spark](https://reader034.vdocuments.us/reader034/viewer/2022042600/587985f31a28ab6c358b65bf/html5/thumbnails/20.jpg)
First ML Example in Spark
![Page 21: Introduction to Machine Learning with Spark](https://reader034.vdocuments.us/reader034/viewer/2022042600/587985f31a28ab6c358b65bf/html5/thumbnails/21.jpg)
Understanding Mathematics behind ML
![Page 22: Introduction to Machine Learning with Spark](https://reader034.vdocuments.us/reader034/viewer/2022042600/587985f31a28ab6c358b65bf/html5/thumbnails/22.jpg)
Major types of ML● Supervised learning
Training data contains both input vector and desired output. We also called it as labeled dataEx : Linear Regression, Logistic Regression
● Unsupervised learningTraining data sets without labels. Ex : K-means clustering
![Page 23: Introduction to Machine Learning with Spark](https://reader034.vdocuments.us/reader034/viewer/2022042600/587985f31a28ab6c358b65bf/html5/thumbnails/23.jpg)
Process of Supervised Learning Training Set
Learning Algorithm
Hypothesis/modelNew data Result
(Prediction/Classification)
![Page 24: Introduction to Machine Learning with Spark](https://reader034.vdocuments.us/reader034/viewer/2022042600/587985f31a28ab6c358b65bf/html5/thumbnails/24.jpg)
ML Terminology● Training Set
Set of data used to teach machine. In supervised learning both input vector and output will be available.
● Learning algorithm Algorithm which consumes the training set to infer relation between input vectors which optimises for known output labels
● Model Learnt function of input parameters
![Page 25: Introduction to Machine Learning with Spark](https://reader034.vdocuments.us/reader034/viewer/2022042600/587985f31a28ab6c358b65bf/html5/thumbnails/25.jpg)
Linear regression● A learning algorithm to learn relation between
dependent variable y with one or more explanatory variable x which are connected by linear relation.
● Given multiple dimensional data, the hypothesis for linear regression looks as below
h(x) = c0 + c1x1 + c2x2 + ….● x1, x2 are the explanatory variables and y is the
dependent variable. h(x) is the hypothesis.● Linear regression goal is to learn c0,c1,c2
![Page 26: Introduction to Machine Learning with Spark](https://reader034.vdocuments.us/reader034/viewer/2022042600/587985f31a28ab6c358b65bf/html5/thumbnails/26.jpg)
Linear Regression Example● Price of the house is dependent on size of the house
● What will be price of the house if size of the house is 1200?
Size of the house(sft) Price(in Rs)
1000 5 lakh
2000 15 lakh
800 4 lakh
![Page 27: Introduction to Machine Learning with Spark](https://reader034.vdocuments.us/reader034/viewer/2022042600/587985f31a28ab6c358b65bf/html5/thumbnails/27.jpg)
Linear regression for housing data● Price of the house is dependent on size of the house● So in linear regression
y -> price of the housex -> size of the house
● Then model we have to learn is Price_of_the_house = c0+ c1* size_of_the_house
● We assume relationship is linear
![Page 28: Introduction to Machine Learning with Spark](https://reader034.vdocuments.us/reader034/viewer/2022042600/587985f31a28ab6c358b65bf/html5/thumbnails/28.jpg)
Curve fitting
![Page 29: Introduction to Machine Learning with Spark](https://reader034.vdocuments.us/reader034/viewer/2022042600/587985f31a28ab6c358b65bf/html5/thumbnails/29.jpg)
Choosing values ● Our model is
h(x) = c0 + c1x1 + c2x2 + ….● How to choose c0, c1, c2?● We want to choose c0,c1,c2 such way which gives us y
values which are close to the one in training set● But how we know they are close to y?● Cost function helps to find out they are close or not.
![Page 30: Introduction to Machine Learning with Spark](https://reader034.vdocuments.us/reader034/viewer/2022042600/587985f31a28ab6c358b65bf/html5/thumbnails/30.jpg)
Cost function● Cost function for linear regression is J(c) = (h(x) - y)^2 / 2m - know as squared error● J(c) is a function of c0,c1,c2 because cost changes
depending upon the different c1, c2● m is number of rows in training data● Goal of learning algorithm is to learn a model which
minimizes this error● How to minimize J(c)?
![Page 31: Introduction to Machine Learning with Spark](https://reader034.vdocuments.us/reader034/viewer/2022042600/587985f31a28ab6c358b65bf/html5/thumbnails/31.jpg)
Curve of cost function
![Page 32: Introduction to Machine Learning with Spark](https://reader034.vdocuments.us/reader034/viewer/2022042600/587985f31a28ab6c358b65bf/html5/thumbnails/32.jpg)
Minimising cost function● Start with some random value of c0, c1● Calculate the cost ● Keep on changing c0, c1 till you find the minimum cost● Gradient descent in one of the algorithm to find
minimum any mathematical function● Also known as convex optimizer● The name gradient comes from, use of gradient of the
function to decide which way to walk
![Page 33: Introduction to Machine Learning with Spark](https://reader034.vdocuments.us/reader034/viewer/2022042600/587985f31a28ab6c358b65bf/html5/thumbnails/33.jpg)
Gradient descent● The way to update c values in gradient descent c(j) = c(j) - alpha * derivative( J(c)) ● alpha is known as learning rate● For linear regression
○ cost function - J(c) = (h(x) - y)^2 / 2m○ Derivative is =
■ c0 - (h(x) - y) / m ■ c1,c2 .. - ((h(x) - y) * x ) / m
![Page 34: Introduction to Machine Learning with Spark](https://reader034.vdocuments.us/reader034/viewer/2022042600/587985f31a28ab6c358b65bf/html5/thumbnails/34.jpg)
Understanding Gradient Descent
![Page 35: Introduction to Machine Learning with Spark](https://reader034.vdocuments.us/reader034/viewer/2022042600/587985f31a28ab6c358b65bf/html5/thumbnails/35.jpg)
Linear regression algorithm● Start with a training set with x1,x2, x3 .. and y● Start with parameters c0,c1,c3 with random values● Start with a learning rate alpha● Then repeat the following update c0 = c0 - alpha * h(x) - y c1 = c1 - alpha * (h(x) - y) * x● Repeat this process till it converges
![Page 36: Introduction to Machine Learning with Spark](https://reader034.vdocuments.us/reader034/viewer/2022042600/587985f31a28ab6c358b65bf/html5/thumbnails/36.jpg)
Vectors in ML● All information in machine learning is represented using
vectors of numbers which represent features● Collection of vectors are represented as matrices● All the calculation in machine learning are expressed
using vector manipulation● So Understanding vector manipulation is very important● We will understand how Scala and Spark represents the
vector
![Page 37: Introduction to Machine Learning with Spark](https://reader034.vdocuments.us/reader034/viewer/2022042600/587985f31a28ab6c358b65bf/html5/thumbnails/37.jpg)
Vector manipulation in Scala/Spark
![Page 38: Introduction to Machine Learning with Spark](https://reader034.vdocuments.us/reader034/viewer/2022042600/587985f31a28ab6c358b65bf/html5/thumbnails/38.jpg)
Breeze - Vector library for Scala● Breeze is library for numerical processing for scala● Option to use fortran based native numeric library using
netlib-java● Spark uses breeze internally for vector manipulation in
MLLib library● Many data structures of MLLib are modeled around
breeze data structures● https://github.com/scalanlp/breeze
![Page 39: Introduction to Machine Learning with Spark](https://reader034.vdocuments.us/reader034/viewer/2022042600/587985f31a28ab6c358b65bf/html5/thumbnails/39.jpg)
Representing data in breeze vector● Two types of vectors
○ Dense vectors○ Sparse vectors
● All vectors in breeze are column vectors. Take transpose for row vector
● Normally row vectors are used to represent the data and column vectors for representing weights or coefficients of learning algorithm
● VectorExample.scala
![Page 40: Introduction to Machine Learning with Spark](https://reader034.vdocuments.us/reader034/viewer/2022042600/587985f31a28ab6c358b65bf/html5/thumbnails/40.jpg)
Vector/Matrix manipulation ● Multiplying vector from a constant
○ Multiply using netlib ○ Multiply using constant vector elementwise
MultiplyVectorByConstant.scala● Multiplying matrix from a constant
○ value○ vector
MultiplyMatrixByConstant.scala
![Page 41: Introduction to Machine Learning with Spark](https://reader034.vdocuments.us/reader034/viewer/2022042600/587985f31a28ab6c358b65bf/html5/thumbnails/41.jpg)
Dot product ● Dot product of two vectors A = [ a1, a2 , a3 ..] and B=
[b1,b2,b3 .. ] is A.B = sum ( a1*b1 + a2*b2 + a3*b3)● Two ways to implement in breeze
○ A.dot(B)○ Transpose(A) * B)
● DotProductExample.scala
![Page 42: Introduction to Machine Learning with Spark](https://reader034.vdocuments.us/reader034/viewer/2022042600/587985f31a28ab6c358b65bf/html5/thumbnails/42.jpg)
In-place computations● Creating a breeze vector/matrix is a costly operations● So if each transformation in our computation creates
new vector, it may hurt our performance● We can use in place operations, which can update
existing vectors rather than creating new vector● When we do many vector manipulations, in-place is
preferred for better performance● Ex : InPlaceExample.scala
![Page 43: Introduction to Machine Learning with Spark](https://reader034.vdocuments.us/reader034/viewer/2022042600/587985f31a28ab6c358b65bf/html5/thumbnails/43.jpg)
Representing data as RDD[Vector]● In Spark ML, each row is represented using vectors● Representing row in vector allows us to easily
manipulate them using earlier discussed vector manipulations
● We broadcast vector for efficiency● We can manipulate partition at a time using represent
them as the matrices● Manipulating as partition can give good performance● RDDVectorExample.scala
![Page 44: Introduction to Machine Learning with Spark](https://reader034.vdocuments.us/reader034/viewer/2022042600/587985f31a28ab6c358b65bf/html5/thumbnails/44.jpg)
Implementing Linear Regression in Spark
![Page 45: Introduction to Machine Learning with Spark](https://reader034.vdocuments.us/reader034/viewer/2022042600/587985f31a28ab6c358b65bf/html5/thumbnails/45.jpg)
Implementing LR in Spark● Represent that data as DataPoint which has
○ x - feature vector○ y - value to be predicted
● Use accumulator to keep track of the cost ● Use reduce to aggregate gradient across multiple rows● Uses gradient descent to work on this● Ex : LinearRegressionExample.scala
![Page 46: Introduction to Machine Learning with Spark](https://reader034.vdocuments.us/reader034/viewer/2022042600/587985f31a28ab6c358b65bf/html5/thumbnails/46.jpg)
Stochastic gradient descent● Use mini batch to do rather complete dataset● Using sampling we can achieve this● Mini batches help to speed up the gradient descent in
multiple level● By default batch size is 1.0● Ex : LRWithSGD.scala
![Page 47: Introduction to Machine Learning with Spark](https://reader034.vdocuments.us/reader034/viewer/2022042600/587985f31a28ab6c358b65bf/html5/thumbnails/47.jpg)
Using MLLib
![Page 48: Introduction to Machine Learning with Spark](https://reader034.vdocuments.us/reader034/viewer/2022042600/587985f31a28ab6c358b65bf/html5/thumbnails/48.jpg)
MLLib● Machine learning library shipped with standard
distribution of Spark● Supports popular machine learning algorithms like
Linear regression, Logistic regression, decision trees etc out of the box
● Every release new algorithms are added● Supports multiple optimization techniques SGD,
LBFGS
![Page 49: Introduction to Machine Learning with Spark](https://reader034.vdocuments.us/reader034/viewer/2022042600/587985f31a28ab6c358b65bf/html5/thumbnails/49.jpg)
Linear Regression in MLLib● org.apache.spark.mllib.linalg.DenseVector wraps
breeze vector for MLLib library● Use LabeledPoint to represent the data of a given row● Built in support for Linear regression with SGD● Ex :LinearRegressionInMLLib.scala
![Page 50: Introduction to Machine Learning with Spark](https://reader034.vdocuments.us/reader034/viewer/2022042600/587985f31a28ab6c358b65bf/html5/thumbnails/50.jpg)
LR for housing data● We are going to predict housing price using house size● Size and price are in different scale● Need to scale both of them to same scale● We use StandardScaler to scale RDD[Vector]● Scaled Data will be used for LinearRegression● Ex : LRForHousingData.scala
![Page 51: Introduction to Machine Learning with Spark](https://reader034.vdocuments.us/reader034/viewer/2022042600/587985f31a28ab6c358b65bf/html5/thumbnails/51.jpg)
Machine learning beyond MLLib● MLLib Pipelines API● MLLib feature framework● Sparkling water http://h2o.ai/product/sparkling-water/● SparkR● http://prediction.io/
![Page 52: Introduction to Machine Learning with Spark](https://reader034.vdocuments.us/reader034/viewer/2022042600/587985f31a28ab6c358b65bf/html5/thumbnails/52.jpg)
References● https://www.coursera.org/learn/machine-learning
● https://github.com/zinniasystems/spark-ml-class
● https://github.com/scalanlp/breeze