linear models for regression - world...

Linear Models for Regression

Agenda

• Introduction

•Gradient Descent• Cost function and minimization

• Implementation

•Evaluating Regression Models

•Regularization

2Copyright © 2018. Data Science Dojo

INTRODUCTION

Copyright © 2018. Data Science Dojo3

Regression

Predict length of patients stay in hospital


Forecast cost of treatment Predict number of staff needed on a given day

Notation: Breast Cancer Dataset

5: The patient is in the 5th row

1: The patient’s diagnosis is the 1st column

𝑥15


123456


So how do we describe all the rows?

𝑥1 = [17.99, 10.38, 122.80]

𝑥2 = [20.57, 17.77, 132.90]

𝑥3 = [19.69, 21.25, 130.00]

Row 1

Row 2

Row 3


The breast cancer dataset records the physical properties of the tumor and it’s diagnosis

Using this notation, we can describe all the columns of the dataset.


𝑥1 𝑥2 𝑥3

𝑋𝑌


Notation Summary

𝑥𝑖– Each row of features𝑥𝑗 – Each column of featuresX – Set of all the feature columns𝑦𝑖 – Each row of the targetY – The target columnn – Number of rows in the datasetm – Number of columns in the dataset

Features

Target


COST FUNCTION AND GRADIENT DESCENT

Copyright © 2018. Data Science Dojo9

What is a good regression line?

•Wind Speed=15 mph

•Ozone = ?

•Use the line that is somewhere in the middle

•How do we define "middle"?

ℎ𝜃 𝑥 = 𝜃0 + 𝜃1𝑥


Defining a line

How do we define a line in slope-intercept notation?• 𝑦 = 𝒎𝑥 + 𝒃

In 𝜃 notation• ℎ𝜃(x)= 𝜽1x + 𝜽0

m = slope

b = intercept 𝜃0

𝜃1


More Features

𝑦 𝑥1 𝑥2 𝑥3

ℎ𝜃 𝑥 = 𝜃0 + 𝜃1𝑥1 + 𝜃2𝑥2 + 𝜃3𝑥3


Residuals (or "Errors")

Difference between hypothesis hθ(x) (predicted value) and true value (known target)

Error 2

Error 1


Cost Function

Minimize the ‘cost’ or ‘loss’ function – 𝐽(𝜃)

• Smaller for lower error

• Larger for higher error

𝐽 𝜃 =1

2𝑛

𝑖=1

𝑛

ℎ𝜃 𝑥𝑖 − 𝑦𝑖2


Error 2

Error 1

Cost Function

θ1=2

θ1=1.0

θ1=0.5

θ0=0

ℎ𝜃 𝑥 = 𝜃0 + 𝜃1𝑥𝐽 𝜃 =

1

2𝑛

𝑖=1

𝑛


Each point on the parabola corresponds to a line on the graph on the left


Cost function in three dimensions

𝐽 𝜃 =1

2𝑛

𝑖=1

𝑛


𝜃0 𝜃1

𝐽(𝜃0,𝜃

1)


How do we find the minimum of the cost function?


Maximum/Minimum Problem

Find two non-negative numbers whose sum is 9 and so that

the product of one number and the square of the other number

is a maximum.


Solution (1/2)

Sum of number is 9

9 = x + y

Product of two numbers is

P = x y2

= x (9-x)2


Solution (2/2)

Using the product rule and chain rule from Calculus 101:

P' = x (2) ( 9-x)(-1) + (1) ( 9-x)2

= ( 9-x) [ -2x + ( 9-x) ]

= ( 9-x) [ 9-3x ]

= ( 9-x) (3)[ 3-x ]

= 0

x=9 or x=3


Maximum Problem

There are 50 apple trees in an orchard.

Each tree produces 800 apples. For each additional tree planted in the orchard, the apple output per tree drops by 10 apples.

Question: How many additional trees should be planted in the existing orchard in order to maximize the apple output of the orchard?


Adding 15 trees will maximize apple production

Solution

A = (50 + t) x (800 – 10t)

A = 40,000 + 300t – 10t2

Solve for A’ and set to 0 to find maximum.A’ = – 20t + 300 = 0t = 15


Gradient Descent

•Goal : minimize 𝐽(𝜃)

•Start with some initial 𝜃 and then perform an update on each 𝜃𝑗 in turn:

•Repeat until 𝜃 converges

𝜃𝑗𝑘+1 ≔ 𝜃𝑗

𝑘 − 𝛼𝜕

𝜕𝜃𝑗𝐽(𝜃𝑘)


Gradient Descent

• 𝛼 is known as the learning rate; set by user

• Each time the algorithm takes a step in the direction of the steepest descent and 𝐽 𝜃 decreases.

• 𝛼 determines how quickly or slowly the algorithm will converge to a solution


𝑘 − 𝛼𝜕



Intuition


𝑘 − 𝛼𝜕


𝜃𝑗𝑘

𝜃𝑗𝑘+1

𝜃𝑗𝑘+3

Positive

gradient

Negative

gradient

𝜃𝑗


Effect of High Learning Rate: Large 𝛼


𝑘 − 𝛼𝜕


Positive

gradient

Negative

gradient

𝜃𝑗


Learning Rate Effects Small 𝛼


𝑘 − 𝛼𝜕


Positive

gradient

Negative

gradient

𝜃𝑗


Gradient Descent Implementation

When do we stop updating?

Positive

gradient

Negative

gradient

𝜃𝑗

Here?Here?

• When 𝜃𝑗𝑘+1 is close to 𝜃𝑗

𝑘

• When 𝐽(𝜃𝑘+1) is close to 𝐽(𝜃𝑘) [Error does not change]


Batch Gradient Descent

•How do we incorporate all our data?

•Loop!

For j from 0 to m:

•ℎ𝜃 is updated only once the loop has completed

•Weaknesses?


𝑘 − 𝛼1

𝑛

𝑖=1

𝑛

ℎ𝜃 𝑥𝑖 − 𝑦𝑖 𝑥𝑗𝑖


𝑘 − 𝛼𝜕


Each represents one feature𝜃𝑗


Batch Gradient Descent

•Loop!

For j from 0 to m:


𝑘 − 𝛼1

𝑛

𝑖=1

𝑛

ℎ𝜃 𝑥𝑖 − 𝑦𝑖 𝑥𝑗𝑖


Stochastic Gradient Descent

• Consider an alternative approach:

• ℎ𝜃 is updated when inner loop is complete

• If the training set is big, converges quicker than batch

• May oscillate around a minimum of 𝐽(𝜃) and never converge

for i from 1 to n:for j from 0 to m:


𝑘 − 𝛼 ℎ𝜃 𝑥𝑖 − 𝑦𝑖 𝑥𝑗𝑖

* We're now only taking one random observation at a time as a sample, instead of averaging across observations


Batch vs. Stochastic

Which is the better to use? It depends.

Batch Gradient

Descent

Stochastic Gradient

Descent

FunctionUpdates hypothesis by

scanning whole dataset

Updates hypothesis by

scanning one training

sample at a time

Rate of convergence Slowly

Quickly

(but may oscillate at

minimum)

Appropriate Dataset

SizeSmall Large


EVALUATING REGRESSION MODELS


Evaluation Metrics for Regression

•Mean Absolute Error (MAE)

•Root-Mean-Square Error (RMSE)• Root-Mean-Square Deviation

•Coefficient of Determination (R2)


Mean Absolute Error

•Mean of residual values

• "Pure" measure of error

𝑀𝐴𝐸 𝜃 =σ𝑖=1𝑛 ℎ𝜃 𝑥𝑖 − 𝑦𝑖

𝑛


Mean Absolute Error - Example

𝑦 = 36, 19, 34, 6, 1, 45

ℎ𝜃 𝑥 = 27,−2.6, 13, −7.3, −2.6, 48

ℎ𝜃 𝑥 − 𝑦 = 9, 21.6, 21, 13.3, 3.6, 3

𝑀𝐴𝐸 𝜃 =71.5

6= 11.9


Root-Mean-Square Error

•Square root of mean of squared residuals

•Penalizes large errors more than small

•Good measure to use to accentuate outliers

𝑅𝑀𝑆𝐸 𝜃 =σ𝑖=1𝑛 ℎ𝜃 𝑥𝑖 − 𝑦𝑖 2

𝑛


RMSE - Example

𝑦 = 36, 19, 34, 6, 1, 45

ℎ𝜃 𝑥 = 27,−2.6, 13, −7.3, −2.6, 48

ℎ𝜃 𝑥 − 𝑦 2 = 81, 467, 441, 177, 13, 9

𝑅𝑆𝑀𝐸 𝜃 =1187

6= 14.1


Coefficient of Determination (R2)

where

𝑆𝑆𝑟𝑒𝑠 – Sum of squared residuals (i.e. total squared error)

𝑆𝑆𝑡𝑜𝑡 –Sum of squared differences from mean (i.e. total variation in dataset)

Result: Measure of how well the model explains the data• "Fraction of variation in data explained by model"

𝑅2 = 1 −𝑆𝑆𝑟𝑒𝑠𝑆𝑆𝑡𝑜𝑡

𝑆𝑆𝑟𝑒𝑠 =

𝑖=1

𝑛


𝑆𝑆𝑡𝑜𝑡 =

𝑖=1

𝑛

𝑦𝑖 − ത𝑦2


REGULARIZATION


OverfittingP

rice

Size

𝜃0 + 𝜃1𝑥

Pri

ce

Size

𝜃0 + 𝜃1𝑥 + 𝜃2𝑥2

Pri

ce

Size

𝜃0 + 𝜃1𝑥 + 𝜃2𝑥2 + 𝜃3𝑥

3 + 𝜃4𝑥4


Intuition

𝐽′ 𝜃 = 𝐽 𝜃 + 𝑃𝑒𝑛𝑎𝑙𝑡𝑦

Pri

ce

Size

𝜃0 + 𝜃1𝑥 + 𝜃2𝑥2 + 𝜃3𝑥

3 + 𝜃4𝑥4 𝜃0 + 𝜃1𝑥 + 𝜃2𝑥

2 + 𝜃3𝑥3 + 𝜃4𝑥

4

Pri

ce

Size

Ensure Small


Definitions

•Two most common methods• L1 regularization

• lasso regression

• L2 regularization • ridge regression

• weight decay𝐽𝐿2 𝜃 =

1

2𝑛

𝑖=1

𝑛

ℎ𝜃 𝑥𝑖 − 𝑦𝑖2+ 𝜆

𝑗=1

𝑚

𝜃𝑗2

𝐽𝐿1 𝜃 =1

2𝑛

𝑖=1

𝑛


𝑗=1

𝑚

𝜃𝑗


Regularized Regression

•Find the best fit

•Keep the 𝜃𝑗 terms as small as possible.

•λ is a user-set parameter which controls the trade off

𝐽𝐿2 𝜃 =1

2𝑛

𝑖=1

𝑛


𝑗=1

𝑚

𝜃𝑗2𝐽𝐿1 𝜃 =

1

2𝑛

𝑖=1

𝑛


𝑗=1

𝑚

𝜃𝑗


Regularized Regression

•Size of 𝜆 important• 𝜆 too high => no fitting

• 𝜆 too low => no regularization

𝐽𝐿2 𝜃 =1

2𝑛

𝑖=1

𝑛


𝑗=1

𝑚

𝜃𝑗2𝐽𝐿1 𝜃 =

1

2𝑛

𝑖=1

𝑛


𝑗=1

𝑚

𝜃𝑗


QUESTIONS


Unsupervised Learning andK-Means Clustering

Unsupervised Learning

• Trying to find hidden structure in unlabeleddata

• No error or reward signal to evaluate a potential solution. No need to pick a response class.

• Common techniques: K-Means clustering, hierarchical clustering, hidden Markov models, etc.

48Copyright (c) 2018. Data Science Dojo

Choose Number of Clusters

Example 1 (domain knowledge / practicalities): Clothing sizes• Tailor-made for each person is too

expensive• One-size-fits-all: does not work!• Groups people of similar sizes together to

make “small”, “medium”, and “large” t-shirts


Choose Number of Clusters

Example 2 (via evaluation): Patient Segmentation• Subdivide patients into

distinct subsets based on disease characteristics

• Where any subset may conceivably be selected as a segment and then be targeted with care models and intervention programs tailored to their needs.


K-Means Clustering

• Partitions data points into similarity clusters

• Unsupervised technique: there is no partitioning into a learning or a test set in unsupervised learning

• Useful in grouping observations

• Only works for numeric data


Data Preparation

• Transform categorical variables into numeric• Standardize•Reduce dimensionality

52

Age Pclass.1 Pclass.2 Pclass.3 Sex.female Sex.male

19 0 1 0 0 1

28 1 0 0 1 0

64 0 0 1 0 1

Often called“dummy variables” or “one-hot encoding”

Copyright (c) 2018. Data Science Dojo

Euclidean Distance

Determine intra- and inter-cluster similarity

53

x1, y1

x2, y2


Intra-cluster distancesare minimized

Inter-cluster distancesare maximized

K-Means Clustering (1/2)

1 2

K-Means Clustering (2/2)

3 4 5

The positions of the cluster centers are determined by the mean of all the points in the cluster.

K-Means Clustering


K=3

K-Means Clustering Algorithm

57

Suppose set of data points: { x1, x2, x3……….xn}• Step 0: Decide the number of clusters, K=1,2,…k.• Step 1: Place centroids at random locations

➢ c1, c2,….ck

• Step 2: Repeat until convergence:

{ for each point xi find nearest centroid cj (eg. Euclidean distance)

assign the point xi to cluster j

for each cluster j = 1..k calculate new centroid cj

cj=mean of all points xi assigned to cluster j in previous step

}• Step 3: Stop when none of the cluster assignments change


K-Means Clustering

• Minimizes aggregate intra-cluster distance

• Measure squared distance from point to center of its cluster.

𝑗=1

𝐾

𝑥∈𝑔𝑗

𝐷 𝑐𝑗, 𝑥2

• Could converge to local minimum

• Different starting points very different results

• Run many times with random starting points

• Nearby points may not be assigned to the same cluster

• Strengths• Simple: easy to understand and to implement

• Efficient: linear time, minimal storage

• Weaknesses • Mean must be well defined

• The user needs to specify k

• Algorithm is sensitive to outliers


K-Means Clustering

Finding K with Elbow Method

60

Option 1 - Percentage of variance

explained as a function of the

number of clusters.

Goal - Choose a number of clusters so that adding

another cluster doesn't give much better modelling

of the data.

Option 2 -Total of the

squared distances of cluster

point to center.


QUESTIONS


Big Data Engineering

• Introduction

• A key problem – machine learning at scale

• Distributed computing with Apache Hadoop & Hive

• Machine learning at scale with Apache Mahout

• Distributed computing v2.0 – Apache Spark

Agenda

Copyright (c) 2018. Data Science Dojo 63

5 Vs of Big Data

Data at restTerabytes to exabytes

of existing data to

process

Velocity

Data in motionStreaming data,

milliseconds to

seconds to respond

Variety

Data in many

formsStructured,

unstructured, text, and

multimedia

Veracity

Data in doubtUncertainty due to

data inconsistency and

incompleteness,

ambiguities, latency,

deception, and model

approximations

Value

Data can have

different valueNot all bytes are

created equal

$$$$

$

$

$

$$ $

$

$

▪ Goal: As data scientists we want cost-effective access to the

raw materials for our data products!


MACHINE LEARNING AT SCALE


OSS R Limits

▪ Single core

▪ Single threaded

Model A Model B Model C

Quad Core Laptop


•Single core

•Single threaded

•All in memory (RAM)

•Vectors & Matrices capped at 4,294,967,295 elements (rows) if 32-bit version; 2^32 - 1

OSS R Limits


OSS R Limits: RAM

• All in memory (RAM)

Laptop Example:

𝑀𝑎𝑥 𝐷𝑎𝑡𝑎 𝐿𝑖𝑚𝑖𝑡 = 𝑇𝑜𝑡𝑎𝑙 𝑅𝐴𝑀 𝐴𝑐𝑐𝑒𝑠𝑠 𝑥 80% − 𝑁𝑜𝑟𝑚𝑎𝑙 𝑅𝐴𝑀 𝑈𝑠𝑎𝑔𝑒

𝑀𝑎𝑥 𝐷𝑎𝑡𝑎 𝐿𝑖𝑚𝑖𝑡 = 5.9 𝑔𝑏 𝑥 80% − 3.2gb𝑀𝑎𝑥 𝐷𝑎𝑡𝑎 𝐿𝑖𝑚𝑖𝑡 = ~1.52𝑔𝑏

*R data frames actually bloats data files by ~3x𝑅 𝐷𝑎𝑡𝑎 𝐿𝑖𝑚𝑖𝑡 = ~1.52𝑔𝑏 ÷ 3 = ~506.7𝑚𝑏


OSS R Limits: RAM

Azure’s VM with largest RAM*:

*Data collected 06/07/2017

24x7x52 Annual Cost: $116,938.44!

𝑀𝑎𝑥 𝐷𝑎𝑡𝑎 𝐿𝑖𝑚𝑖𝑡 = 2000𝑔𝑏 𝑥 80% − 1𝑔𝑏𝑀𝑎𝑥 𝐷𝑎𝑡𝑎 𝐿𝑖𝑚𝑖𝑡 = ~1600𝑔𝑏𝑅 𝐷𝑎𝑡𝑎 𝐿𝑖𝑚𝑖𝑡 = ~1600𝑔𝑏 ÷ 3 = ~533.33 𝑔𝑏


Machine Learning Scaling

•Hadoop

•Spark

•H20

•Microsoft R

Server

Distributed

•Azure ML

•AWS ML

•Big ML

•Cloud Virtual

Machines

Cloud

•R

•Python

•SAS

Programming

•Excel

Programs

This only gets us so far! Big data scale!


DISTRIBUTED COMPUTING WITH APACHE HADOOP


Turn Back The Clock, The Mainframe

• “Big Iron”• Backbone of computing for

decades.• Still widely used.• “Scale-up” model of shared

computing.• Core platform is cost effective,

ecosystem is not (e.g., software licensing).

• The original VM host!


Distributed Computing


Cloud Computing

• Conceptually – a combination of mainframe and distributed computing.

• VM hosts are now the “Big Iron”.

• Many VMs work together to distribute workloads.

• Some workloads on dedicated HW (e.g., SAP HANA).


Scaling Computational Power

▪ Horizontal Scaling, Scaling OUT

▪ Commodity hardware, distributedNew Scaling:

▪ Vertical Scaling, Scaling UP

▪ High performance computersOld Scaling:


What is Hadoop?

• OSS Platform for distributed computing over Internet-scale data.

• Originally built at Yahoo!

• Implementation of ideas (e.g., MapReduce) published by Google.

• The de facto standard big data platform.

• Named after a stuffed animal belonging to Doug Cutting’s son.


Distributed batch processing engine for big data.

Hadoop at Base

Storage Compute


HDFS & MapReduce

60gb of Tweets

1 Computer

Processing: 30 hours

60gb


HDFS & MapReduce

60gb of Tweets


30gb

2 Computers


HDFS & MapReduce

60 Gb of Tweets


20Gb

3 Computers


Most Cases, Linear Scaling Of Processing Power

Number of Computers Processing Time (hours)

1 30

2 15

3 10

4 7.5

5 6

6 5

7 4.26

8 3.75

9 3.33


Head Node

(Named Node)Data Nodes

If dogs were servers…


DataNode 2

DataNode 1

DataNode 3

Partition 3

Partition 2

Partition1

HDFS

HDFS Partitioning


B1B2

DataNode 1

O1O2

DataNode 3

G1G2

HDFS Redundancy

DataNode 2

O1 O2B1

B2G2G1


MapReduce – Sandwich Analogy


Limitations with MapReduce

• Lot of code to perform the simplest task

• Slow

• Troubleshooting multiple computers

• Good devs are scarce

• Expensive certifications


Hive

• Abstraction built on top of MapReduce & HDFS.

• Makes Hadoop look like an RDBMS (e.g., coding in SQL).

• Developed by Facebook to democratize Hadoop.

• Applies structure to data at runtime (“schema on read”).


Hive Jobs

HiveQL

Statement

Translation &

Conversion

MapReduce

Job


Word Count Revisited

vs.SELECT word,

COUNT(*) AS

word_count

FROM words

GROUP BY word;


Caution

SELECT * FROM ANYTHING: This brings back everything. Everything

doesn’t fit on a single computer.

JOIN: Join will take hours or days to perform and eat up all cluster

bandwidth for everyone else trying to use it in the queue.

ORDER BY: Sorting is very computationally expensive.

Sub Queries: A sub query essentially creates a secondary table, which

will be huge in HIVE.

Interactivity: SQL in DBMS is interactive because it's almost

instantaneous.


HDInsight

Hadoop Implementations


Blob Storage

Hadoop in Azure

HDInsight

Azure Data

Lake Store

Compute

HDFS

Storage


Mahout

• Distributed Machine Learning platform.

• Built on top of MapReduce and HDFS.

• Script-based and command line interfaces.

• R-like language implementation.


Data

Node 2

Data

Node 1

Data

Node 3

Partition 3

Partition 2

Partition1

Distributed Random Forest

HDFS

Partitioning


Data

Node 2

Data

Node 1

Data

Node 3

Data

Shuffle



NODE

A

NODE

BNODE

CBag 1 Bag 2 Bag 3

Decision Tree A Decision Tree B Decision Tree C



Processing Times - Machine Learning

Data

Cleaning

Training

Prediction

Hours

Days

Milliseconds

• Large scale systems are only needed for training

• Phones can use models outputted by mahout to predict new data

• After a model is trained, save the model to any IO file type and reload it where you want

Bottleneck


Distributed computing v2.0 – Apache spark


What is Spark?

• “A fast and general engine for large-scale data processing.”

• Designed to incorporate the goodness of Hadoop and address Hadoop’s shortcomings.

• Can complement Hadoop via integration with both HDFS and Hive.


Why Spark? Improved Perf!

▪ Up to 10x faster than Hadoop working with data from disk.*

▪ Up to 100x faster working with data stored in memory!*

* benchmark is without Apache Yarn


Daytona GraySort Contest: Sort 100 TB of data!

Previous World Record:• Method: Hadoop• Yahoo!• 72 Minutes• 2100 Nodes

2014:• Method: Spark• Databricks• 23 Minutes• 206 Nodes

3x faster on 10x fewer machines!

Source: https://databricks.com/blog/2014/10/10/spark-petabyte-sort.html

Big Data, Faster!


Conceptual Architecture


Spark and Hadoop

YAR

N

HDFS

MapReduce

HiveJava APISpark SQL

Spark Streamin

gMLlib

▪ Spark can be deployed on a Hadoop cluster and share cluster resources via YARN.

▪ Spark, however, does not require Hadoop!


QUESTIONS


Interpreting Findings from Machine Learning

Outline

• Metrics for a machine learning problem

• Examples in health (Case Study)

• Interpretation of a metric

• Conclusion


Classification

• When the number of possible predictions is finite then it is a classification problem e.g.,• Benign Vs. malignant tumor (2 possible

predictions)


Right Metrics for Classification Problem

Following are the go-to metrics for evaluating any classification problem including the health care applications.

• Precision, recall, specificity, F1 score

• ROC, AUC

• Accuracy

• Log-Loss

• Root Mean Squared Error (useful when classification is done for predictions)

• Some times one metric does not clarify the whole picture.

• Therefore, multiple metrics should be considered for evaluating classifiers

• Evaluation metrics for binary classification (two possible classes). However, the concepts can easily be extended to M-ary classification (M possible classes)


Interpreting Metrics for Imbalanced Datasets

• So, which model is the better one?


Cost Difference Between FP (Type I) and FN (Type 2) Errors

• Imagine the datasets for two applications where positive examples represent• Chronic schizophrenia patients having suicidal tendencies

(dataset 1)• Fitzpatrick scale for human skin color (dataset 2)

• Which model is better for dataset 1 and which model is better for dataset 2?

• For both models and the datasets Log-Loss is not a useful metric as it is the same for both

Model Accuracy Precision Recall F1 Score AUC Log-Loss

Model 1 0.97 1 0.83 0.91 0.85 0.2

Model 2 0.94 0.75 1 0.86 0.8 0.2


Cost of FP Vs. FN for Dataset 1

• The FP represents the patients that actually are not chronic schizophrenic and are wrongfully classified as schizophrenic

• The FN represents the patients that actually are chronic and are wrongfully classified as -ve

• The cost associated with FP and FN is very skewed. • FN costs way too more than that of FP. FP costs a few more

tests while FN can cost the human life

• For dataset 1, the goal should be to minimize the FN. That is, the model having higher Recall should be given preference for dataset 1

• Since Model 2 has a higher Recall value, it is the preferred model over Model 1


FP Vs. FN for Dataset 2

• It is assumed that the dataset 2 is for the Fitzpatrick scale for human skin color. There are 6 possible predictions in {Type I, Type II, …., Type VI}

• For simplicity, let’s assume we are interested in identifying the Type I from the rest

• Fitzpatrick scale is useful for skincare (and cosmetics industry)


FP Vs. FN for Dataset 2

• The FP for this dataset represents the patients who actually have not a Type I skin tone and are wrongfully classified as Type I

• The FN represents the patients that actually have Type I skin tone and are wrongfully classified as other than Type I

• The costs associated with FP and FN are similar

• For such datasets the goal should be to minimize both the FN and the FP

• Hence, the model having higher Accuracy together with the higher AUC should be given more weightage

• Therefore Model 1 is a better classifier where the cost of error for FP and FN is symmetric


Interpreting Metrics for More +veExamples


Same Cost of Type I and Type II Errors

Model Accuracy Precision Recall F1 Score AUC

Model 1 0.94 1 0.67 0.8 0.8

Model 2 0.94 0.75 1 0.86 0.9

Model 3 0.94 0.93 1 0.964 0.8

Model 4 0.94 1 0.92 0.958 0.9

• AUC handles the more +ve and more –ve examples the same way

• F1 score for Model 3 and 4 is very similar. The reason is that because of large number of +ve examples and the fact that F1 score depends (read deteriorates) on +ve examples’ misclassification only

• For imbalanced data F1 score and AUC are important. However, F1 score becomes more important when there are fewer +veexamples, which is quite common in healthcare


Which Classifier is Better?

Classifier Accuracy Precision Recall F1 Score AUC

Model 1 0.90 0.87 0.88 0.875 0.97

Model 2 0.91 0.92 0.83 0.873 0.96

• Classifiers

• Support Vector Machines (Model 1)

• KNN (Model 2)

• As per the accuracy, Model 2 is slightly better than the Model 1

• As discussed, accuracy’s calculation does not take into account the difference in the costs associated with the FP (Type I) and FN (Type II) errors


AUC

• The AUC for both the classifiers is close to 1, which is desirable. That is, the TPR is higher and the FPR is lower for both the classifiers

• As per the AUC, Model 1 is slightly better

• Accuracy of model is slightly higher and AUC is slightly higher for the other model. So, the accuracy and AUC are not helpful in this case

• So the question still remains to be answered: which Model is the better one?


Recall, Precision, F1 Score

• FP (examples predicted as Malignant that actually are Benign) costs the patient a few more tests and eventually his/her money

• While FN (examples predicted as Benign that actually are Malignant) costs the human life

• In this case the choice for the health care provider’s preference for the model is obvious: chose the model with a lower FN. FN is inversely proportional to the Recall

• The more the Recall the better, that is, the bigger proportion of Malignant patients are identified correctly.


Answer: The Higher Recall

• Usually, more Recall and more Precision are preferred. However in this case it is acceptable to identify the patients as FP, conduct a few more tests and be more certain. This results in the lower Precision

• Though, both models have the similar F1 score and Model 2’s Precision is more than that of Model 1

• Higher Recall value for Model 1 that makes it a superior classifier in this example.


Balanced Dataset Example: Early stroke detection and diagnosis

• Stroke is a frequent disease that has affected 500 million people world wide

• It is the leading cause of death in China and the 5th

largest in the US *

• Let’s consider the F1 score, AUC and Log-Loss as the possible evaluation metrics

* Artificial intelligence in healthcare: past, present and future (https://svn.bmj.com/content/2/4/230)

Models F1 score AUC Log-Loss

Model 1 0.88 0.94 0.28

Model 2 0.97 0.98 0.6


https://svn.bmj.com/content/2/4/230

Interpreting findings for the case of balanced dataset

• Model 2 has the better• F1 score

• AUC

• However, as per the Log-Loss, Model 1 is the better one

• Though the data is balanced, the cost of Type I is different from that of Type II error. This gives more credence to F1 score and AUC than the Log-Loss.

• AUC can be reasonable even for inferior models. As seen in this example where it’s 0.94 for the worse and 0.98 for the better model

• Therefore, more importantly, it is the higher F1 score for Model 2 that makes more sense in this scenario.


Case Study: The Henry Ford ExercIseTesting (FIT) Project

• Data and evaluation results are obtained from the journal article [1] published on April 18, 2018

• The clinical research dataset consists of 23,095 patients, collected by the FIT project to investigate relative performance of different classification techniques for predicting the individuals at risk of developing hypertension using medical records of cardiorespiratory fitness.

• The study compares the performance of six different ML models for predicting the individuals at risk of developing hypertension using cardiorespiratory fitness data.

• Using different validation methods, the RTF model on the dataset has shown the best performance (AUC = 0.93) which outperforms the models of the previous studies.

[1] Sakr S, Elshawi R, Ahmed A, Qureshi WT, Brawner C, et al. (2018) Using machine learning on cardiorespiratory fitness data for

predicting hypertension: The Henry Ford ExercIse Testing (FIT) Project. PLOS ONE 13(4): e0195344.

https://doi.org/10.1371/journal.pone.0195344



The Henry Ford ExercIse Testing (FIT) Project

• The study compares the performance of six different ML models for predicting the individuals at risk of developing hypertension using cardiorespiratory fitness data.

• Using different validation methods, the RTF model on the dataset has shown the best performance (AUC = 0.93) which outperforms the models of the previous studies.

[1] Sakr S, Elshawi R, Ahmed A, Qureshi WT, Brawner C, et al. (2018) Using machine learning on cardiorespiratory fitness data for

predicting hypertension: The Henry Ford ExercIse Testing (FIT) Project. PLOS ONE 13(4): e0195344.




AUC Curves for the Different Machine Learning Models using SMOTE evaluated using 10-fold cross-validation

• RTF has the best ROC curve. • AUC appears to be the maximum


The Metrics

• The RTF (Random Tree Forest) model achieves the highest AUC (0.93), F 1 Score (86.70%), sensitivity (69.96%) and specificity (91.71%).

• What does the higher specificity for RTF mean?• Remember that specificity is the true negative recognition rate:

TN/(TN+FP)• The higher the specificity the lower the FP• Hence fewer patients are tested further

• What does the higher sensitivity/recall for RTF mean?• Fewer FN and hence fewer +ve patients sneak through the ML

paradigm


A Note for ML Enthusiasts

• The results show that it is not necessary that the more complex the machine learning model, the better prediction accuracy that can be achieved. Simpler models can perform better in some cases as well.

• The results have also shown that it is critical to carefully explore and evaluate the performance of the machine learning models using various model evaluation methods as the prediction accuracy can significantly differ.


When to Use a Particular Metric?

• For balanced class accuracy is a good metric

• AUC is a good metric for balanced data, however, it is more effective for imbalanced dataset

• If there is a dominant class (imbalanced dataset) then give more importance to AUC and F1 score

• If the goal is to classify the smaller class better, irrespective of it being a +veor a –ve class, then AUC is a good measure

• F1 score is important when the +ve class is small.

• If the application requires to have the minimum FN then go for more recall

• If the application requires to have the minimum FP then go for more precision

• Higher recall/sensitivity is better for identifying the +ve examples

• Higher specificity is better for identifying the -ve examples

• Though, rarely used, Log-Loss is important for absolute probabilistic difference. It is important in some applications

• RMSE is useful when classification algorithms are evaluated for predictions


Conclusion

• One metric is good in one scenario and may not work for another

• Metrics should be chosen based on the balance of the data

• If +ve examples are fewer or are more important to classify then it gives credence to certain metrics over others

• In general, it is a better idea to calculate more metrics and then decide in favor of a particular model

• Generally, a good ML algorithm strikes a good balance between the Precision and Recall.

• If the difference in costs of Type I and Type II errors is large then Precision and Recall are the preferred metrics

• For health care applications the best scenario is when both recall and specificity are maximum


Implementation of Policy RecommendationsPerformance-Based Financing in Health

Using Supervised Learning to Select Audit Targets in Performance-Based Financing in Health: An Example from ZambiaBy Dhruv Grover, Sebastian Bauhoff, and Jed Friedman


Setting the context

• Zambia operated a pilot project from 2012-4 of performance-based financing of public health centers

• Public health centers are paid for the quantity and quality of the services they deliver

• Public health centres (covering 11% of Zambia’s population) in 10 rural districts participated


The Program Improved Certain Indicators But…

• 42% of facilities over-reported at least once in the four quarters measured

• Financing per service incentivized both provision of the service and over-reporting

• Payment for over-reported services undermines the incentive for delivery of the service and is a waste of public resources => we need to minimize over-reporting


What Measures Were in Place to Reduce Over-reporting?

• Dedicated district steering committees conducted continuous internal verification by reconciling the facility-reported information with the paper-based evidence

• An independent third-party conducted a one-off external verification process after 2-years of the program operation (costing $22.5k)


They Need to Target the Verification So it Identifies Over-reporting Facilities But Isn’t Cost Prohibitive

• The aim of the external verification is to minimize over-reporting whilst minimizing verification costs

• You could independently verify every single facility => this would completely eliminate over-reporting BUT this is probably cost prohibitive

• You could not verify any facilities BUT there is likely to be a substantial amount of over-reporting which may worsen over time as practitioners realise they can take advantage of the lack of verification


How to Identify Facilities That are Likely to Have Over-reported?

• It is possible to identify over-reporting using random sampling or machine learning

• They predict over-reporting defined as: • 1 if the difference between the reported and verified

data is > 10% of the reported value

• 0 otherwise

• Using the following input features:• Reported and verified values for the 9 quantity

measures rewarded in the PBF program;

• Control variables


Factors to Determine Choice of ML Technique

1. What is the size of the training dataset?

2. Can features be treated as independent variables?

3. Will additional training data become available in the future and need to be incorporated into the model?

4. Is the data linearly separable?

5. Is overfitting expected to be a problem?

6. Are there any speed, performance, memory usage requirements?


The Algorithms Learn the Patterns From the Inputs That Indicate That a Facility is at Risk From Over-reporting

• The patterns are learnt on data only from the first quarter

• The models (algorithm + data + parameters) are measured on how well they correctly identify facilities that over-report in the first quarter

• And the models can then apply this learning to predict the risk for other facilities over-reporting on unseen data in subsequent quarters

• On this occasion, random forest outperformed all the other algorithms on all 5 metrics


How Does This Change the Targeting of Verification?

• The verifiers can send audit teams to verify data at those facilities that have the highest risk of over-reporting

• They save c.$800 per clinic they didn’t inspect because it wasn’t necessary

• The verifiers need to periodically collect another random sample to re-train the model so that those identified as low-risk taking advantage of the lack of monitoring and over-reporting.


linear models for regression - world...

Documents