linear models for regression - world...
TRANSCRIPT
Linear Models for Regression
Agenda
• Introduction
•Gradient Descent• Cost function and minimization
• Implementation
•Evaluating Regression Models
•Regularization
2Copyright © 2018. Data Science Dojo
INTRODUCTION
Copyright © 2018. Data Science Dojo3
Regression
Predict length of patients stay in hospital
4Copyright © 2018. Data Science Dojo
Forecast cost of treatment Predict number of staff needed on a given day
Notation: Breast Cancer Dataset
5: The patient is in the 5th row
1: The patient’s diagnosis is the 1st column
𝑥15
5Copyright © 2018. Data Science Dojo
123456
Notation: Breast Cancer Dataset
So how do we describe all the rows?
𝑥1 = [17.99, 10.38, 122.80]
𝑥2 = [20.57, 17.77, 132.90]
𝑥3 = [19.69, 21.25, 130.00]
Row 1
Row 2
Row 3
6Copyright © 2018. Data Science Dojo
The breast cancer dataset records the physical properties of the tumor and it’s diagnosis
Using this notation, we can describe all the columns of the dataset.
Notation: Breast Cancer Dataset
𝑥1 𝑥2 𝑥3
𝑋𝑌
7Copyright © 2018. Data Science Dojo
Notation Summary
𝑥𝑖– Each row of features𝑥𝑗 – Each column of featuresX – Set of all the feature columns𝑦𝑖 – Each row of the targetY – The target columnn – Number of rows in the datasetm – Number of columns in the dataset
Features
Target
8Copyright © 2018. Data Science Dojo
COST FUNCTION AND GRADIENT DESCENT
Copyright © 2018. Data Science Dojo9
What is a good regression line?
•Wind Speed=15 mph
•Ozone = ?
•Use the line that is somewhere in the middle
•How do we define "middle"?
ℎ𝜃 𝑥 = 𝜃0 + 𝜃1𝑥
10Copyright © 2018. Data Science Dojo
Defining a line
How do we define a line in slope-intercept notation?• 𝑦 = 𝒎𝑥 + 𝒃
In 𝜃 notation• ℎ𝜃(x)= 𝜽1x + 𝜽0
m = slope
b = intercept 𝜃0
𝜃1
11Copyright © 2018. Data Science Dojo
More Features
𝑦 𝑥1 𝑥2 𝑥3
ℎ𝜃 𝑥 = 𝜃0 + 𝜃1𝑥1 + 𝜃2𝑥2 + 𝜃3𝑥3
12Copyright © 2018. Data Science Dojo
Residuals (or "Errors")
Difference between hypothesis hθ(x) (predicted value) and true value (known target)
Error 2
Error 1
13Copyright © 2018. Data Science Dojo
Cost Function
Minimize the ‘cost’ or ‘loss’ function – 𝐽(𝜃)
• Smaller for lower error
• Larger for higher error
𝐽 𝜃 =1
2𝑛
𝑖=1
𝑛
ℎ𝜃 𝑥𝑖 − 𝑦𝑖2
14Copyright © 2018. Data Science Dojo
Error 2
Error 1
Cost Function
θ1=2
θ1=1.0
θ1=0.5
θ0=0
ℎ𝜃 𝑥 = 𝜃0 + 𝜃1𝑥𝐽 𝜃 =
1
2𝑛
𝑖=1
𝑛
ℎ𝜃 𝑥𝑖 − 𝑦𝑖2
Each point on the parabola corresponds to a line on the graph on the left
15Copyright © 2018. Data Science Dojo
Cost function in three dimensions
𝐽 𝜃 =1
2𝑛
𝑖=1
𝑛
ℎ𝜃 𝑥𝑖 − 𝑦𝑖2
𝜃0 𝜃1
𝐽(𝜃0,𝜃
1)
16Copyright © 2018. Data Science Dojo
How do we find the minimum of the cost function?
17Copyright © 2018. Data Science Dojo
Maximum/Minimum Problem
Find two non-negative numbers whose sum is 9 and so that
the product of one number and the square of the other number
is a maximum.
18Copyright © 2018. Data Science Dojo
Solution (1/2)
Sum of number is 9
9 = x + y
Product of two numbers is
P = x y2
= x (9-x)2
19Copyright © 2018. Data Science Dojo
Solution (2/2)
Using the product rule and chain rule from Calculus 101:
P' = x (2) ( 9-x)(-1) + (1) ( 9-x)2
= ( 9-x) [ -2x + ( 9-x) ]
= ( 9-x) [ 9-3x ]
= ( 9-x) (3)[ 3-x ]
= 0
x=9 or x=3
20Copyright © 2018. Data Science Dojo
Maximum Problem
There are 50 apple trees in an orchard.
Each tree produces 800 apples. For each additional tree planted in the orchard, the apple output per tree drops by 10 apples.
Question: How many additional trees should be planted in the existing orchard in order to maximize the apple output of the orchard?
21Copyright © 2018. Data Science Dojo
Adding 15 trees will maximize apple production
Solution
A = (50 + t) x (800 – 10t)
A = 40,000 + 300t – 10t2
Solve for A’ and set to 0 to find maximum.A’ = – 20t + 300 = 0t = 15
22Copyright © 2018. Data Science Dojo
Gradient Descent
•Goal : minimize 𝐽(𝜃)
•Start with some initial 𝜃 and then perform an update on each 𝜃𝑗 in turn:
•Repeat until 𝜃 converges
𝜃𝑗𝑘+1 ≔ 𝜃𝑗
𝑘 − 𝛼𝜕
𝜕𝜃𝑗𝐽(𝜃𝑘)
23Copyright © 2018. Data Science Dojo
Gradient Descent
• 𝛼 is known as the learning rate; set by user
• Each time the algorithm takes a step in the direction of the steepest descent and 𝐽 𝜃 decreases.
• 𝛼 determines how quickly or slowly the algorithm will converge to a solution
𝜃𝑗𝑘+1 ≔ 𝜃𝑗
𝑘 − 𝛼𝜕
𝜕𝜃𝑗𝐽(𝜃𝑘)
24Copyright © 2018. Data Science Dojo
Intuition
𝜃𝑗𝑘+1 ≔ 𝜃𝑗
𝑘 − 𝛼𝜕
𝜕𝜃𝑗𝐽(𝜃𝑘)
𝜃𝑗𝑘
𝜃𝑗𝑘+1
𝜃𝑗𝑘+3
Positive
gradient
Negative
gradient
𝜃𝑗
25Copyright © 2018. Data Science Dojo
Effect of High Learning Rate: Large 𝛼
𝜃𝑗𝑘+1 ≔ 𝜃𝑗
𝑘 − 𝛼𝜕
𝜕𝜃𝑗𝐽(𝜃𝑘)
Positive
gradient
Negative
gradient
𝜃𝑗
26Copyright © 2018. Data Science Dojo
Learning Rate Effects Small 𝛼
𝜃𝑗𝑘+1 ≔ 𝜃𝑗
𝑘 − 𝛼𝜕
𝜕𝜃𝑗𝐽(𝜃𝑘)
Positive
gradient
Negative
gradient
𝜃𝑗
27Copyright © 2018. Data Science Dojo
Gradient Descent Implementation
When do we stop updating?
Positive
gradient
Negative
gradient
𝜃𝑗
Here?Here?
• When 𝜃𝑗𝑘+1 is close to 𝜃𝑗
𝑘
• When 𝐽(𝜃𝑘+1) is close to 𝐽(𝜃𝑘) [Error does not change]
28Copyright © 2018. Data Science Dojo
Batch Gradient Descent
•How do we incorporate all our data?
•Loop!
For j from 0 to m:
•ℎ𝜃 is updated only once the loop has completed
•Weaknesses?
𝜃𝑗𝑘+1 ≔ 𝜃𝑗
𝑘 − 𝛼1
𝑛
𝑖=1
𝑛
ℎ𝜃 𝑥𝑖 − 𝑦𝑖 𝑥𝑗𝑖
𝜃𝑗𝑘+1 ≔ 𝜃𝑗
𝑘 − 𝛼𝜕
𝜕𝜃𝑗𝐽(𝜃𝑘)
Each represents one feature𝜃𝑗
29Copyright © 2018. Data Science Dojo
Batch Gradient Descent
•Loop!
For j from 0 to m:
𝜃𝑗𝑘+1 ≔ 𝜃𝑗
𝑘 − 𝛼1
𝑛
𝑖=1
𝑛
ℎ𝜃 𝑥𝑖 − 𝑦𝑖 𝑥𝑗𝑖
30Copyright © 2018. Data Science Dojo
Stochastic Gradient Descent
• Consider an alternative approach:
• ℎ𝜃 is updated when inner loop is complete
• If the training set is big, converges quicker than batch
• May oscillate around a minimum of 𝐽(𝜃) and never converge
for i from 1 to n:for j from 0 to m:
𝜃𝑗𝑘+1 ≔ 𝜃𝑗
𝑘 − 𝛼 ℎ𝜃 𝑥𝑖 − 𝑦𝑖 𝑥𝑗𝑖
* We're now only taking one random observation at a time as a sample, instead of averaging across observations
31Copyright © 2018. Data Science Dojo
Batch vs. Stochastic
Which is the better to use? It depends.
Batch Gradient
Descent
Stochastic Gradient
Descent
FunctionUpdates hypothesis by
scanning whole dataset
Updates hypothesis by
scanning one training
sample at a time
Rate of convergence Slowly
Quickly
(but may oscillate at
minimum)
Appropriate Dataset
SizeSmall Large
32Copyright © 2018. Data Science Dojo
EVALUATING REGRESSION MODELS
33Copyright © 2018. Data Science Dojo
Evaluation Metrics for Regression
•Mean Absolute Error (MAE)
•Root-Mean-Square Error (RMSE)• Root-Mean-Square Deviation
•Coefficient of Determination (R2)
34Copyright © 2018. Data Science Dojo
Mean Absolute Error
•Mean of residual values
• "Pure" measure of error
𝑀𝐴𝐸 𝜃 =σ𝑖=1𝑛 ℎ𝜃 𝑥𝑖 − 𝑦𝑖
𝑛
35Copyright © 2018. Data Science Dojo
Mean Absolute Error - Example
𝑦 = 36, 19, 34, 6, 1, 45
ℎ𝜃 𝑥 = 27,−2.6, 13, −7.3, −2.6, 48
ℎ𝜃 𝑥 − 𝑦 = 9, 21.6, 21, 13.3, 3.6, 3
𝑀𝐴𝐸 𝜃 =71.5
6= 11.9
36Copyright © 2018. Data Science Dojo
Root-Mean-Square Error
•Square root of mean of squared residuals
•Penalizes large errors more than small
•Good measure to use to accentuate outliers
𝑅𝑀𝑆𝐸 𝜃 =σ𝑖=1𝑛 ℎ𝜃 𝑥𝑖 − 𝑦𝑖 2
𝑛
37Copyright © 2018. Data Science Dojo
RMSE - Example
𝑦 = 36, 19, 34, 6, 1, 45
ℎ𝜃 𝑥 = 27,−2.6, 13, −7.3, −2.6, 48
ℎ𝜃 𝑥 − 𝑦 2 = 81, 467, 441, 177, 13, 9
𝑅𝑆𝑀𝐸 𝜃 =1187
6= 14.1
38Copyright © 2018. Data Science Dojo
Coefficient of Determination (R2)
where
𝑆𝑆𝑟𝑒𝑠 – Sum of squared residuals (i.e. total squared error)
𝑆𝑆𝑡𝑜𝑡 –Sum of squared differences from mean (i.e. total variation in dataset)
Result: Measure of how well the model explains the data• "Fraction of variation in data explained by model"
𝑅2 = 1 −𝑆𝑆𝑟𝑒𝑠𝑆𝑆𝑡𝑜𝑡
𝑆𝑆𝑟𝑒𝑠 =
𝑖=1
𝑛
ℎ𝜃 𝑥𝑖 − 𝑦𝑖2
𝑆𝑆𝑡𝑜𝑡 =
𝑖=1
𝑛
𝑦𝑖 − ത𝑦2
39Copyright © 2018. Data Science Dojo
REGULARIZATION
40Copyright © 2018. Data Science Dojo
OverfittingP
rice
Size
𝜃0 + 𝜃1𝑥
Pri
ce
Size
𝜃0 + 𝜃1𝑥 + 𝜃2𝑥2
Pri
ce
Size
𝜃0 + 𝜃1𝑥 + 𝜃2𝑥2 + 𝜃3𝑥
3 + 𝜃4𝑥4
41Copyright © 2018. Data Science Dojo
Intuition
𝐽′ 𝜃 = 𝐽 𝜃 + 𝑃𝑒𝑛𝑎𝑙𝑡𝑦
Pri
ce
Size
𝜃0 + 𝜃1𝑥 + 𝜃2𝑥2 + 𝜃3𝑥
3 + 𝜃4𝑥4 𝜃0 + 𝜃1𝑥 + 𝜃2𝑥
2 + 𝜃3𝑥3 + 𝜃4𝑥
4
Pri
ce
Size
Ensure Small
42Copyright © 2018. Data Science Dojo
Definitions
•Two most common methods• L1 regularization
• lasso regression
• L2 regularization • ridge regression
• weight decay𝐽𝐿2 𝜃 =
1
2𝑛
𝑖=1
𝑛
ℎ𝜃 𝑥𝑖 − 𝑦𝑖2+ 𝜆
𝑗=1
𝑚
𝜃𝑗2
𝐽𝐿1 𝜃 =1
2𝑛
𝑖=1
𝑛
ℎ𝜃 𝑥𝑖 − 𝑦𝑖2+ 𝜆
𝑗=1
𝑚
𝜃𝑗
43Copyright © 2018. Data Science Dojo
Regularized Regression
•Find the best fit
•Keep the 𝜃𝑗 terms as small as possible.
•λ is a user-set parameter which controls the trade off
𝐽𝐿2 𝜃 =1
2𝑛
𝑖=1
𝑛
ℎ𝜃 𝑥𝑖 − 𝑦𝑖2+ 𝜆
𝑗=1
𝑚
𝜃𝑗2𝐽𝐿1 𝜃 =
1
2𝑛
𝑖=1
𝑛
ℎ𝜃 𝑥𝑖 − 𝑦𝑖2+ 𝜆
𝑗=1
𝑚
𝜃𝑗
44Copyright © 2018. Data Science Dojo
Regularized Regression
•Size of 𝜆 important• 𝜆 too high => no fitting
• 𝜆 too low => no regularization
𝐽𝐿2 𝜃 =1
2𝑛
𝑖=1
𝑛
ℎ𝜃 𝑥𝑖 − 𝑦𝑖2+ 𝜆
𝑗=1
𝑚
𝜃𝑗2𝐽𝐿1 𝜃 =
1
2𝑛
𝑖=1
𝑛
ℎ𝜃 𝑥𝑖 − 𝑦𝑖2+ 𝜆
𝑗=1
𝑚
𝜃𝑗
45Copyright © 2018. Data Science Dojo
QUESTIONS
46Copyright © 2018. Data Science Dojo
Unsupervised Learning andK-Means Clustering
Unsupervised Learning
• Trying to find hidden structure in unlabeleddata
• No error or reward signal to evaluate a potential solution. No need to pick a response class.
• Common techniques: K-Means clustering, hierarchical clustering, hidden Markov models, etc.
48Copyright (c) 2018. Data Science Dojo
Choose Number of Clusters
Example 1 (domain knowledge / practicalities): Clothing sizes• Tailor-made for each person is too
expensive• One-size-fits-all: does not work!• Groups people of similar sizes together to
make “small”, “medium”, and “large” t-shirts
49Copyright (c) 2018. Data Science Dojo
Choose Number of Clusters
Example 2 (via evaluation): Patient Segmentation• Subdivide patients into
distinct subsets based on disease characteristics
• Where any subset may conceivably be selected as a segment and then be targeted with care models and intervention programs tailored to their needs.
50Copyright (c) 2018. Data Science Dojo
K-Means Clustering
• Partitions data points into similarity clusters
• Unsupervised technique: there is no partitioning into a learning or a test set in unsupervised learning
• Useful in grouping observations
• Only works for numeric data
51Copyright (c) 2018. Data Science Dojo
Data Preparation
• Transform categorical variables into numeric• Standardize•Reduce dimensionality
52
Age Pclass.1 Pclass.2 Pclass.3 Sex.female Sex.male
19 0 1 0 0 1
28 1 0 0 1 0
64 0 0 1 0 1
Often called“dummy variables” or “one-hot encoding”
Copyright (c) 2018. Data Science Dojo
Euclidean Distance
Determine intra- and inter-cluster similarity
53
x1, y1
x2, y2
Copyright (c) 2018. Data Science Dojo
Intra-cluster distancesare minimized
Inter-cluster distancesare maximized
K-Means Clustering (1/2)
1 2
K-Means Clustering (2/2)
3 4 5
The positions of the cluster centers are determined by the mean of all the points in the cluster.
K-Means Clustering
56Copyright (c) 2018. Data Science Dojo
K=3
K-Means Clustering Algorithm
57
Suppose set of data points: { x1, x2, x3……….xn}• Step 0: Decide the number of clusters, K=1,2,…k.• Step 1: Place centroids at random locations
➢ c1, c2,….ck
• Step 2: Repeat until convergence:
{ for each point xi find nearest centroid cj (eg. Euclidean distance)
assign the point xi to cluster j
for each cluster j = 1..k calculate new centroid cj
cj=mean of all points xi assigned to cluster j in previous step
}• Step 3: Stop when none of the cluster assignments change
Copyright (c) 2018. Data Science Dojo
K-Means Clustering
• Minimizes aggregate intra-cluster distance
• Measure squared distance from point to center of its cluster.
𝑗=1
𝐾
𝑥∈𝑔𝑗
𝐷 𝑐𝑗, 𝑥2
• Could converge to local minimum
• Different starting points very different results
• Run many times with random starting points
• Nearby points may not be assigned to the same cluster
• Strengths• Simple: easy to understand and to implement
• Efficient: linear time, minimal storage
• Weaknesses • Mean must be well defined
• The user needs to specify k
• Algorithm is sensitive to outliers
59Copyright (c) 2018. Data Science Dojo
K-Means Clustering
Finding K with Elbow Method
60
Option 1 - Percentage of variance
explained as a function of the
number of clusters.
Goal - Choose a number of clusters so that adding
another cluster doesn't give much better modelling
of the data.
Option 2 -Total of the
squared distances of cluster
point to center.
Copyright (c) 2018. Data Science Dojo
QUESTIONS
61Copyright (c) 2018. Data Science Dojo
Big Data Engineering
• Introduction
• A key problem – machine learning at scale
• Distributed computing with Apache Hadoop & Hive
• Machine learning at scale with Apache Mahout
• Distributed computing v2.0 – Apache Spark
Agenda
Copyright (c) 2018. Data Science Dojo 63
5 Vs of Big Data
Data at restTerabytes to exabytes
of existing data to
process
Velocity
Data in motionStreaming data,
milliseconds to
seconds to respond
Variety
Data in many
formsStructured,
unstructured, text, and
multimedia
Veracity
Data in doubtUncertainty due to
data inconsistency and
incompleteness,
ambiguities, latency,
deception, and model
approximations
Value
Data can have
different valueNot all bytes are
created equal
$$$$
$
$
$
$$ $
$
$
▪ Goal: As data scientists we want cost-effective access to the
raw materials for our data products!
Copyright (c) 2018. Data Science Dojo 64
MACHINE LEARNING AT SCALE
65Copyright © 2018. Data Science Dojo
OSS R Limits
▪ Single core
▪ Single threaded
Model A Model B Model C
Quad Core Laptop
Copyright (c) 2018. Data Science Dojo 66
•Single core
•Single threaded
•All in memory (RAM)
•Vectors & Matrices capped at 4,294,967,295 elements (rows) if 32-bit version; 2^32 - 1
OSS R Limits
Copyright (c) 2018. Data Science Dojo 67
OSS R Limits: RAM
• All in memory (RAM)
Laptop Example:
𝑀𝑎𝑥 𝐷𝑎𝑡𝑎 𝐿𝑖𝑚𝑖𝑡 = 𝑇𝑜𝑡𝑎𝑙 𝑅𝐴𝑀 𝐴𝑐𝑐𝑒𝑠𝑠 𝑥 80% − 𝑁𝑜𝑟𝑚𝑎𝑙 𝑅𝐴𝑀 𝑈𝑠𝑎𝑔𝑒
𝑀𝑎𝑥 𝐷𝑎𝑡𝑎 𝐿𝑖𝑚𝑖𝑡 = 5.9 𝑔𝑏 𝑥 80% − 3.2gb𝑀𝑎𝑥 𝐷𝑎𝑡𝑎 𝐿𝑖𝑚𝑖𝑡 = ~1.52𝑔𝑏
*R data frames actually bloats data files by ~3x𝑅 𝐷𝑎𝑡𝑎 𝐿𝑖𝑚𝑖𝑡 = ~1.52𝑔𝑏 ÷ 3 = ~506.7𝑚𝑏
Copyright (c) 2018. Data Science Dojo 68
OSS R Limits: RAM
Azure’s VM with largest RAM*:
*Data collected 06/07/2017
24x7x52 Annual Cost: $116,938.44!
𝑀𝑎𝑥 𝐷𝑎𝑡𝑎 𝐿𝑖𝑚𝑖𝑡 = 2000𝑔𝑏 𝑥 80% − 1𝑔𝑏𝑀𝑎𝑥 𝐷𝑎𝑡𝑎 𝐿𝑖𝑚𝑖𝑡 = ~1600𝑔𝑏𝑅 𝐷𝑎𝑡𝑎 𝐿𝑖𝑚𝑖𝑡 = ~1600𝑔𝑏 ÷ 3 = ~533.33 𝑔𝑏
Copyright (c) 2018. Data Science Dojo 69
Machine Learning Scaling
•Hadoop
•Spark
•H20
•Microsoft R
Server
Distributed
•Azure ML
•AWS ML
•Big ML
•Cloud Virtual
Machines
Cloud
•R
•Python
•SAS
Programming
•Excel
Programs
This only gets us so far! Big data scale!
Copyright (c) 2018. Data Science Dojo 70
DISTRIBUTED COMPUTING WITH APACHE HADOOP
71Copyright © 2018. Data Science Dojo
Turn Back The Clock, The Mainframe
• “Big Iron”• Backbone of computing for
decades.• Still widely used.• “Scale-up” model of shared
computing.• Core platform is cost effective,
ecosystem is not (e.g., software licensing).
• The original VM host!
Copyright (c) 2018. Data Science Dojo 72
Distributed Computing
Copyright (c) 2018. Data Science Dojo 73
Cloud Computing
• Conceptually – a combination of mainframe and distributed computing.
• VM hosts are now the “Big Iron”.
• Many VMs work together to distribute workloads.
• Some workloads on dedicated HW (e.g., SAP HANA).
Copyright (c) 2018. Data Science Dojo 74
Scaling Computational Power
▪ Horizontal Scaling, Scaling OUT
▪ Commodity hardware, distributedNew Scaling:
▪ Vertical Scaling, Scaling UP
▪ High performance computersOld Scaling:
Copyright (c) 2018. Data Science Dojo 75
What is Hadoop?
• OSS Platform for distributed computing over Internet-scale data.
• Originally built at Yahoo!
• Implementation of ideas (e.g., MapReduce) published by Google.
• The de facto standard big data platform.
• Named after a stuffed animal belonging to Doug Cutting’s son.
Copyright (c) 2018. Data Science Dojo 76
Distributed batch processing engine for big data.
Hadoop at Base
Storage Compute
Copyright (c) 2018. Data Science Dojo 77
HDFS & MapReduce
60gb of Tweets
1 Computer
Processing: 30 hours
60gb
Copyright (c) 2018. Data Science Dojo 78
HDFS & MapReduce
60gb of Tweets
Processing: 15 hours
30gb
2 Computers
Copyright (c) 2018. Data Science Dojo 79
HDFS & MapReduce
60 Gb of Tweets
Processing: 10 hours
20Gb
3 Computers
Copyright (c) 2018. Data Science Dojo 80
Most Cases, Linear Scaling Of Processing Power
Number of Computers Processing Time (hours)
1 30
2 15
3 10
4 7.5
5 6
6 5
7 4.26
8 3.75
9 3.33
Copyright (c) 2018. Data Science Dojo 81
Head Node
(Named Node)Data Nodes
If dogs were servers…
Copyright (c) 2018. Data Science Dojo 82
DataNode 2
DataNode 1
DataNode 3
Partition 3
Partition 2
Partition1
HDFS
HDFS Partitioning
Copyright (c) 2018. Data Science Dojo 83
B1B2
DataNode 1
O1O2
DataNode 3
G1G2
HDFS Redundancy
DataNode 2
O1 O2B1
B2G2G1
Copyright (c) 2018. Data Science Dojo 84
MapReduce – Sandwich Analogy
Copyright (c) 2018. Data Science Dojo 85
Limitations with MapReduce
• Lot of code to perform the simplest task
• Slow
• Troubleshooting multiple computers
• Good devs are scarce
• Expensive certifications
Copyright (c) 2018. Data Science Dojo 86
Hive
• Abstraction built on top of MapReduce & HDFS.
• Makes Hadoop look like an RDBMS (e.g., coding in SQL).
• Developed by Facebook to democratize Hadoop.
• Applies structure to data at runtime (“schema on read”).
Copyright (c) 2018. Data Science Dojo 87
Hive Jobs
HiveQL
Statement
Translation &
Conversion
MapReduce
Job
Copyright (c) 2018. Data Science Dojo 88
Word Count Revisited
vs.SELECT word,
COUNT(*) AS
word_count
FROM words
GROUP BY word;
Copyright (c) 2018. Data Science Dojo 89
Caution
SELECT * FROM ANYTHING: This brings back everything. Everything
doesn’t fit on a single computer.
JOIN: Join will take hours or days to perform and eat up all cluster
bandwidth for everyone else trying to use it in the queue.
ORDER BY: Sorting is very computationally expensive.
Sub Queries: A sub query essentially creates a secondary table, which
will be huge in HIVE.
Interactivity: SQL in DBMS is interactive because it's almost
instantaneous.
Copyright (c) 2018. Data Science Dojo 90
HDInsight
Hadoop Implementations
Copyright (c) 2018. Data Science Dojo 91
Blob Storage
Hadoop in Azure
HDInsight
Azure Data
Lake Store
Compute
HDFS
Storage
Copyright (c) 2018. Data Science Dojo 92
Mahout
• Distributed Machine Learning platform.
• Built on top of MapReduce and HDFS.
• Script-based and command line interfaces.
• R-like language implementation.
Copyright (c) 2018. Data Science Dojo 93
Data
Node 2
Data
Node 1
Data
Node 3
Partition 3
Partition 2
Partition1
Distributed Random Forest
HDFS
Partitioning
Copyright (c) 2018. Data Science Dojo 94
Data
Node 2
Data
Node 1
Data
Node 3
Data
Shuffle
Distributed Random Forest
Copyright (c) 2018. Data Science Dojo 95
NODE
A
NODE
BNODE
CBag 1 Bag 2 Bag 3
Decision Tree A Decision Tree B Decision Tree C
Distributed Random Forest
Copyright (c) 2018. Data Science Dojo 96
Processing Times - Machine Learning
Data
Cleaning
Training
Prediction
Hours
Days
Milliseconds
• Large scale systems are only needed for training
• Phones can use models outputted by mahout to predict new data
• After a model is trained, save the model to any IO file type and reload it where you want
Bottleneck
Copyright (c) 2018. Data Science Dojo 97
Distributed computing v2.0 – Apache spark
Copyright (c) 2018. Data Science Dojo 98
What is Spark?
• “A fast and general engine for large-scale data processing.”
• Designed to incorporate the goodness of Hadoop and address Hadoop’s shortcomings.
• Can complement Hadoop via integration with both HDFS and Hive.
Copyright (c) 2018. Data Science Dojo 99
Why Spark? Improved Perf!
▪ Up to 10x faster than Hadoop working with data from disk.*
▪ Up to 100x faster working with data stored in memory!*
* benchmark is without Apache Yarn
Copyright (c) 2018. Data Science Dojo 100
Daytona GraySort Contest: Sort 100 TB of data!
Previous World Record:• Method: Hadoop• Yahoo!• 72 Minutes• 2100 Nodes
2014:• Method: Spark• Databricks• 23 Minutes• 206 Nodes
3x faster on 10x fewer machines!
Source: https://databricks.com/blog/2014/10/10/spark-petabyte-sort.html
Big Data, Faster!
Copyright (c) 2018. Data Science Dojo 101
Conceptual Architecture
Copyright (c) 2018. Data Science Dojo 102
Spark and Hadoop
YAR
N
HDFS
MapReduce
HiveJava APISpark SQL
Spark Streamin
gMLlib
▪ Spark can be deployed on a Hadoop cluster and share cluster resources via YARN.
▪ Spark, however, does not require Hadoop!
Copyright (c) 2018. Data Science Dojo 103
QUESTIONS
104Copyright © 2018. Data Science Dojo
Interpreting Findings from Machine Learning
Outline
• Metrics for a machine learning problem
• Examples in health (Case Study)
• Interpretation of a metric
• Conclusion
106Copyright (c) 2018. Data Science Dojo
Classification
• When the number of possible predictions is finite then it is a classification problem e.g.,• Benign Vs. malignant tumor (2 possible
predictions)
107Copyright (c) 2018. Data Science Dojo
Right Metrics for Classification Problem
Following are the go-to metrics for evaluating any classification problem including the health care applications.
• Precision, recall, specificity, F1 score
• ROC, AUC
• Accuracy
• Log-Loss
• Root Mean Squared Error (useful when classification is done for predictions)
• Some times one metric does not clarify the whole picture.
• Therefore, multiple metrics should be considered for evaluating classifiers
• Evaluation metrics for binary classification (two possible classes). However, the concepts can easily be extended to M-ary classification (M possible classes)
Copyright (c) 2018. Data Science Dojo 108
Interpreting Metrics for Imbalanced Datasets
• So, which model is the better one?
Copyright (c) 2018. Data Science Dojo 109
Cost Difference Between FP (Type I) and FN (Type 2) Errors
• Imagine the datasets for two applications where positive examples represent• Chronic schizophrenia patients having suicidal tendencies
(dataset 1)• Fitzpatrick scale for human skin color (dataset 2)
• Which model is better for dataset 1 and which model is better for dataset 2?
• For both models and the datasets Log-Loss is not a useful metric as it is the same for both
Model Accuracy Precision Recall F1 Score AUC Log-Loss
Model 1 0.97 1 0.83 0.91 0.85 0.2
Model 2 0.94 0.75 1 0.86 0.8 0.2
Copyright (c) 2018. Data Science Dojo 110
Cost of FP Vs. FN for Dataset 1
• The FP represents the patients that actually are not chronic schizophrenic and are wrongfully classified as schizophrenic
• The FN represents the patients that actually are chronic and are wrongfully classified as -ve
• The cost associated with FP and FN is very skewed. • FN costs way too more than that of FP. FP costs a few more
tests while FN can cost the human life
• For dataset 1, the goal should be to minimize the FN. That is, the model having higher Recall should be given preference for dataset 1
• Since Model 2 has a higher Recall value, it is the preferred model over Model 1
Copyright (c) 2018. Data Science Dojo 111
FP Vs. FN for Dataset 2
• It is assumed that the dataset 2 is for the Fitzpatrick scale for human skin color. There are 6 possible predictions in {Type I, Type II, …., Type VI}
• For simplicity, let’s assume we are interested in identifying the Type I from the rest
• Fitzpatrick scale is useful for skincare (and cosmetics industry)
Copyright (c) 2018. Data Science Dojo 112
FP Vs. FN for Dataset 2
• The FP for this dataset represents the patients who actually have not a Type I skin tone and are wrongfully classified as Type I
• The FN represents the patients that actually have Type I skin tone and are wrongfully classified as other than Type I
• The costs associated with FP and FN are similar
• For such datasets the goal should be to minimize both the FN and the FP
• Hence, the model having higher Accuracy together with the higher AUC should be given more weightage
• Therefore Model 1 is a better classifier where the cost of error for FP and FN is symmetric
Copyright (c) 2018. Data Science Dojo 113
Interpreting Metrics for More +veExamples
Copyright (c) 2018. Data Science Dojo 114
Same Cost of Type I and Type II Errors
Model Accuracy Precision Recall F1 Score AUC
Model 1 0.94 1 0.67 0.8 0.8
Model 2 0.94 0.75 1 0.86 0.9
Model 3 0.94 0.93 1 0.964 0.8
Model 4 0.94 1 0.92 0.958 0.9
• AUC handles the more +ve and more –ve examples the same way
• F1 score for Model 3 and 4 is very similar. The reason is that because of large number of +ve examples and the fact that F1 score depends (read deteriorates) on +ve examples’ misclassification only
• For imbalanced data F1 score and AUC are important. However, F1 score becomes more important when there are fewer +veexamples, which is quite common in healthcare
Copyright (c) 2018. Data Science Dojo 115
Which Classifier is Better?
Classifier Accuracy Precision Recall F1 Score AUC
Model 1 0.90 0.87 0.88 0.875 0.97
Model 2 0.91 0.92 0.83 0.873 0.96
• Classifiers
• Support Vector Machines (Model 1)
• KNN (Model 2)
• As per the accuracy, Model 2 is slightly better than the Model 1
• As discussed, accuracy’s calculation does not take into account the difference in the costs associated with the FP (Type I) and FN (Type II) errors
Copyright (c) 2018. Data Science Dojo 116
AUC
• The AUC for both the classifiers is close to 1, which is desirable. That is, the TPR is higher and the FPR is lower for both the classifiers
• As per the AUC, Model 1 is slightly better
• Accuracy of model is slightly higher and AUC is slightly higher for the other model. So, the accuracy and AUC are not helpful in this case
• So the question still remains to be answered: which Model is the better one?
Copyright (c) 2018. Data Science Dojo 117
Recall, Precision, F1 Score
• FP (examples predicted as Malignant that actually are Benign) costs the patient a few more tests and eventually his/her money
• While FN (examples predicted as Benign that actually are Malignant) costs the human life
• In this case the choice for the health care provider’s preference for the model is obvious: chose the model with a lower FN. FN is inversely proportional to the Recall
• The more the Recall the better, that is, the bigger proportion of Malignant patients are identified correctly.
Copyright (c) 2018. Data Science Dojo 118
Answer: The Higher Recall
• Usually, more Recall and more Precision are preferred. However in this case it is acceptable to identify the patients as FP, conduct a few more tests and be more certain. This results in the lower Precision
• Though, both models have the similar F1 score and Model 2’s Precision is more than that of Model 1
• Higher Recall value for Model 1 that makes it a superior classifier in this example.
Copyright (c) 2018. Data Science Dojo 119
Balanced Dataset Example: Early stroke detection and diagnosis
• Stroke is a frequent disease that has affected 500 million people world wide
• It is the leading cause of death in China and the 5th
largest in the US *
• Let’s consider the F1 score, AUC and Log-Loss as the possible evaluation metrics
* Artificial intelligence in healthcare: past, present and future (https://svn.bmj.com/content/2/4/230)
Models F1 score AUC Log-Loss
Model 1 0.88 0.94 0.28
Model 2 0.97 0.98 0.6
Copyright (c) 2018. Data Science Dojo 120
Interpreting findings for the case of balanced dataset
• Model 2 has the better• F1 score
• AUC
• However, as per the Log-Loss, Model 1 is the better one
• Though the data is balanced, the cost of Type I is different from that of Type II error. This gives more credence to F1 score and AUC than the Log-Loss.
• AUC can be reasonable even for inferior models. As seen in this example where it’s 0.94 for the worse and 0.98 for the better model
• Therefore, more importantly, it is the higher F1 score for Model 2 that makes more sense in this scenario.
Copyright (c) 2018. Data Science Dojo 121
Case Study: The Henry Ford ExercIseTesting (FIT) Project
• Data and evaluation results are obtained from the journal article [1] published on April 18, 2018
• The clinical research dataset consists of 23,095 patients, collected by the FIT project to investigate relative performance of different classification techniques for predicting the individuals at risk of developing hypertension using medical records of cardiorespiratory fitness.
• The study compares the performance of six different ML models for predicting the individuals at risk of developing hypertension using cardiorespiratory fitness data.
• Using different validation methods, the RTF model on the dataset has shown the best performance (AUC = 0.93) which outperforms the models of the previous studies.
[1] Sakr S, Elshawi R, Ahmed A, Qureshi WT, Brawner C, et al. (2018) Using machine learning on cardiorespiratory fitness data for
predicting hypertension: The Henry Ford ExercIse Testing (FIT) Project. PLOS ONE 13(4): e0195344.
https://doi.org/10.1371/journal.pone.0195344
Copyright (c) 2018. Data Science Dojo 122
The Henry Ford ExercIse Testing (FIT) Project
• The study compares the performance of six different ML models for predicting the individuals at risk of developing hypertension using cardiorespiratory fitness data.
• Using different validation methods, the RTF model on the dataset has shown the best performance (AUC = 0.93) which outperforms the models of the previous studies.
[1] Sakr S, Elshawi R, Ahmed A, Qureshi WT, Brawner C, et al. (2018) Using machine learning on cardiorespiratory fitness data for
predicting hypertension: The Henry Ford ExercIse Testing (FIT) Project. PLOS ONE 13(4): e0195344.
https://doi.org/10.1371/journal.pone.0195344
Copyright (c) 2018. Data Science Dojo 123
AUC Curves for the Different Machine Learning Models using SMOTE evaluated using 10-fold cross-validation
• RTF has the best ROC curve. • AUC appears to be the maximum
Copyright (c) 2018. Data Science Dojo 124
The Metrics
• The RTF (Random Tree Forest) model achieves the highest AUC (0.93), F 1 Score (86.70%), sensitivity (69.96%) and specificity (91.71%).
• What does the higher specificity for RTF mean?• Remember that specificity is the true negative recognition rate:
TN/(TN+FP)• The higher the specificity the lower the FP• Hence fewer patients are tested further
• What does the higher sensitivity/recall for RTF mean?• Fewer FN and hence fewer +ve patients sneak through the ML
paradigm
Copyright (c) 2018. Data Science Dojo 125
A Note for ML Enthusiasts
• The results show that it is not necessary that the more complex the machine learning model, the better prediction accuracy that can be achieved. Simpler models can perform better in some cases as well.
• The results have also shown that it is critical to carefully explore and evaluate the performance of the machine learning models using various model evaluation methods as the prediction accuracy can significantly differ.
Copyright (c) 2018. Data Science Dojo 126
When to Use a Particular Metric?
• For balanced class accuracy is a good metric
• AUC is a good metric for balanced data, however, it is more effective for imbalanced dataset
• If there is a dominant class (imbalanced dataset) then give more importance to AUC and F1 score
• If the goal is to classify the smaller class better, irrespective of it being a +veor a –ve class, then AUC is a good measure
• F1 score is important when the +ve class is small.
• If the application requires to have the minimum FN then go for more recall
• If the application requires to have the minimum FP then go for more precision
• Higher recall/sensitivity is better for identifying the +ve examples
• Higher specificity is better for identifying the -ve examples
• Though, rarely used, Log-Loss is important for absolute probabilistic difference. It is important in some applications
• RMSE is useful when classification algorithms are evaluated for predictions
Copyright (c) 2018. Data Science Dojo 127
Conclusion
• One metric is good in one scenario and may not work for another
• Metrics should be chosen based on the balance of the data
• If +ve examples are fewer or are more important to classify then it gives credence to certain metrics over others
• In general, it is a better idea to calculate more metrics and then decide in favor of a particular model
• Generally, a good ML algorithm strikes a good balance between the Precision and Recall.
• If the difference in costs of Type I and Type II errors is large then Precision and Recall are the preferred metrics
• For health care applications the best scenario is when both recall and specificity are maximum
Copyright (c) 2018. Data Science Dojo 128
Implementation of Policy RecommendationsPerformance-Based Financing in Health
Using Supervised Learning to Select Audit Targets in Performance-Based Financing in Health: An Example from ZambiaBy Dhruv Grover, Sebastian Bauhoff, and Jed Friedman
130Copyright (c) 2018. Data Science Dojo
Setting the context
• Zambia operated a pilot project from 2012-4 of performance-based financing of public health centers
• Public health centers are paid for the quantity and quality of the services they deliver
• Public health centres (covering 11% of Zambia’s population) in 10 rural districts participated
131Copyright (c) 2018. Data Science Dojo
The Program Improved Certain Indicators But…
• 42% of facilities over-reported at least once in the four quarters measured
• Financing per service incentivized both provision of the service and over-reporting
• Payment for over-reported services undermines the incentive for delivery of the service and is a waste of public resources => we need to minimize over-reporting
132Copyright (c) 2018. Data Science Dojo
What Measures Were in Place to Reduce Over-reporting?
• Dedicated district steering committees conducted continuous internal verification by reconciling the facility-reported information with the paper-based evidence
• An independent third-party conducted a one-off external verification process after 2-years of the program operation (costing $22.5k)
133Copyright (c) 2018. Data Science Dojo
They Need to Target the Verification So it Identifies Over-reporting Facilities But Isn’t Cost Prohibitive
• The aim of the external verification is to minimize over-reporting whilst minimizing verification costs
• You could independently verify every single facility => this would completely eliminate over-reporting BUT this is probably cost prohibitive
• You could not verify any facilities BUT there is likely to be a substantial amount of over-reporting which may worsen over time as practitioners realise they can take advantage of the lack of verification
134Copyright (c) 2018. Data Science Dojo
How to Identify Facilities That are Likely to Have Over-reported?
• It is possible to identify over-reporting using random sampling or machine learning
• They predict over-reporting defined as: • 1 if the difference between the reported and verified
data is > 10% of the reported value
• 0 otherwise
• Using the following input features:• Reported and verified values for the 9 quantity
measures rewarded in the PBF program;
• Control variables
135Copyright (c) 2018. Data Science Dojo
Factors to Determine Choice of ML Technique
1. What is the size of the training dataset?
2. Can features be treated as independent variables?
3. Will additional training data become available in the future and need to be incorporated into the model?
4. Is the data linearly separable?
5. Is overfitting expected to be a problem?
6. Are there any speed, performance, memory usage requirements?
136Copyright (c) 2018. Data Science Dojo
The Algorithms Learn the Patterns From the Inputs That Indicate That a Facility is at Risk From Over-reporting
• The patterns are learnt on data only from the first quarter
• The models (algorithm + data + parameters) are measured on how well they correctly identify facilities that over-report in the first quarter
• And the models can then apply this learning to predict the risk for other facilities over-reporting on unseen data in subsequent quarters
• On this occasion, random forest outperformed all the other algorithms on all 5 metrics
137Copyright (c) 2018. Data Science Dojo
How Does This Change the Targeting of Verification?
• The verifiers can send audit teams to verify data at those facilities that have the highest risk of over-reporting
• They save c.$800 per clinic they didn’t inspect because it wasn’t necessary
• The verifiers need to periodically collect another random sample to re-train the model so that those identified as low-risk taking advantage of the lack of monitoring and over-reporting.
138Copyright (c) 2018. Data Science Dojo
How Does This Change the Targeting of Verification?
• The verifiers can send audit teams to verify data at those facilities that have the highest risk of over-reporting
• They save c.$800 per clinic they didn’t inspect because it wasn’t necessary
• The verifiers need to periodically collect another random sample to re-train the model so that those identified as low-risk taking advantage of the lack of monitoring and over-reporting.
139Copyright (c) 2018. Data Science Dojo