linear models for regression - world...

139
Linear Models for Regression

Upload: others

Post on 20-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1

Linear Models for Regression

Page 2: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1

Agenda

• Introduction

•Gradient Descent• Cost function and minimization

• Implementation

•Evaluating Regression Models

•Regularization

2Copyright © 2018. Data Science Dojo

Page 3: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1

INTRODUCTION

Copyright © 2018. Data Science Dojo3

Page 4: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1

Regression

Predict length of patients stay in hospital

4Copyright © 2018. Data Science Dojo

Forecast cost of treatment Predict number of staff needed on a given day

Page 5: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1

Notation: Breast Cancer Dataset

5: The patient is in the 5th row

1: The patient’s diagnosis is the 1st column

𝑥15

5Copyright © 2018. Data Science Dojo

123456

Page 6: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1

Notation: Breast Cancer Dataset

So how do we describe all the rows?

𝑥1 = [17.99, 10.38, 122.80]

𝑥2 = [20.57, 17.77, 132.90]

𝑥3 = [19.69, 21.25, 130.00]

Row 1

Row 2

Row 3

6Copyright © 2018. Data Science Dojo

Page 7: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1

The breast cancer dataset records the physical properties of the tumor and it’s diagnosis

Using this notation, we can describe all the columns of the dataset.

Notation: Breast Cancer Dataset

𝑥1 𝑥2 𝑥3

𝑋𝑌

7Copyright © 2018. Data Science Dojo

Page 8: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1

Notation Summary

𝑥𝑖– Each row of features𝑥𝑗 – Each column of featuresX – Set of all the feature columns𝑦𝑖 – Each row of the targetY – The target columnn – Number of rows in the datasetm – Number of columns in the dataset

Features

Target

8Copyright © 2018. Data Science Dojo

Page 9: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1

COST FUNCTION AND GRADIENT DESCENT

Copyright © 2018. Data Science Dojo9

Page 10: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1

What is a good regression line?

•Wind Speed=15 mph

•Ozone = ?

•Use the line that is somewhere in the middle

•How do we define "middle"?

ℎ𝜃 𝑥 = 𝜃0 + 𝜃1𝑥

10Copyright © 2018. Data Science Dojo

Page 11: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1

Defining a line

How do we define a line in slope-intercept notation?• 𝑦 = 𝒎𝑥 + 𝒃

In 𝜃 notation• ℎ𝜃(x)= 𝜽1x + 𝜽0

m = slope

b = intercept 𝜃0

𝜃1

11Copyright © 2018. Data Science Dojo

Page 12: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1

More Features

𝑦 𝑥1 𝑥2 𝑥3

ℎ𝜃 𝑥 = 𝜃0 + 𝜃1𝑥1 + 𝜃2𝑥2 + 𝜃3𝑥3

12Copyright © 2018. Data Science Dojo

Page 13: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1

Residuals (or "Errors")

Difference between hypothesis hθ(x) (predicted value) and true value (known target)

Error 2

Error 1

13Copyright © 2018. Data Science Dojo

Page 14: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1

Cost Function

Minimize the ‘cost’ or ‘loss’ function – 𝐽(𝜃)

• Smaller for lower error

• Larger for higher error

𝐽 𝜃 =1

2𝑛

𝑖=1

𝑛

ℎ𝜃 𝑥𝑖 − 𝑦𝑖2

14Copyright © 2018. Data Science Dojo

Error 2

Error 1

Page 15: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1

Cost Function

θ1=2

θ1=1.0

θ1=0.5

θ0=0

ℎ𝜃 𝑥 = 𝜃0 + 𝜃1𝑥𝐽 𝜃 =

1

2𝑛

𝑖=1

𝑛

ℎ𝜃 𝑥𝑖 − 𝑦𝑖2

Each point on the parabola corresponds to a line on the graph on the left

15Copyright © 2018. Data Science Dojo

Page 16: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1

Cost function in three dimensions

𝐽 𝜃 =1

2𝑛

𝑖=1

𝑛

ℎ𝜃 𝑥𝑖 − 𝑦𝑖2

𝜃0 𝜃1

𝐽(𝜃0,𝜃

1)

16Copyright © 2018. Data Science Dojo

Page 17: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1

How do we find the minimum of the cost function?

17Copyright © 2018. Data Science Dojo

Page 18: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1

Maximum/Minimum Problem

Find two non-negative numbers whose sum is 9 and so that

the product of one number and the square of the other number

is a maximum.

18Copyright © 2018. Data Science Dojo

Page 19: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1

Solution (1/2)

Sum of number is 9

9 = x + y

Product of two numbers is

P = x y2

= x (9-x)2

19Copyright © 2018. Data Science Dojo

Page 20: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1

Solution (2/2)

Using the product rule and chain rule from Calculus 101:

P' = x (2) ( 9-x)(-1) + (1) ( 9-x)2

= ( 9-x) [ -2x + ( 9-x) ]

= ( 9-x) [ 9-3x ]

= ( 9-x) (3)[ 3-x ]

= 0

x=9 or x=3

20Copyright © 2018. Data Science Dojo

Page 21: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1

Maximum Problem

There are 50 apple trees in an orchard.

Each tree produces 800 apples. For each additional tree planted in the orchard, the apple output per tree drops by 10 apples.

Question: How many additional trees should be planted in the existing orchard in order to maximize the apple output of the orchard?

21Copyright © 2018. Data Science Dojo

Page 22: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1

Adding 15 trees will maximize apple production

Solution

A = (50 + t) x (800 – 10t)

A = 40,000 + 300t – 10t2

Solve for A’ and set to 0 to find maximum.A’ = – 20t + 300 = 0t = 15

22Copyright © 2018. Data Science Dojo

Page 23: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1

Gradient Descent

•Goal : minimize 𝐽(𝜃)

•Start with some initial 𝜃 and then perform an update on each 𝜃𝑗 in turn:

•Repeat until 𝜃 converges

𝜃𝑗𝑘+1 ≔ 𝜃𝑗

𝑘 − 𝛼𝜕

𝜕𝜃𝑗𝐽(𝜃𝑘)

23Copyright © 2018. Data Science Dojo

Page 24: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1

Gradient Descent

• 𝛼 is known as the learning rate; set by user

• Each time the algorithm takes a step in the direction of the steepest descent and 𝐽 𝜃 decreases.

• 𝛼 determines how quickly or slowly the algorithm will converge to a solution

𝜃𝑗𝑘+1 ≔ 𝜃𝑗

𝑘 − 𝛼𝜕

𝜕𝜃𝑗𝐽(𝜃𝑘)

24Copyright © 2018. Data Science Dojo

Page 25: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1

Intuition

𝜃𝑗𝑘+1 ≔ 𝜃𝑗

𝑘 − 𝛼𝜕

𝜕𝜃𝑗𝐽(𝜃𝑘)

𝜃𝑗𝑘

𝜃𝑗𝑘+1

𝜃𝑗𝑘+3

Positive

gradient

Negative

gradient

𝜃𝑗

25Copyright © 2018. Data Science Dojo

Page 26: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1

Effect of High Learning Rate: Large 𝛼

𝜃𝑗𝑘+1 ≔ 𝜃𝑗

𝑘 − 𝛼𝜕

𝜕𝜃𝑗𝐽(𝜃𝑘)

Positive

gradient

Negative

gradient

𝜃𝑗

26Copyright © 2018. Data Science Dojo

Page 27: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1

Learning Rate Effects Small 𝛼

𝜃𝑗𝑘+1 ≔ 𝜃𝑗

𝑘 − 𝛼𝜕

𝜕𝜃𝑗𝐽(𝜃𝑘)

Positive

gradient

Negative

gradient

𝜃𝑗

27Copyright © 2018. Data Science Dojo

Page 28: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1

Gradient Descent Implementation

When do we stop updating?

Positive

gradient

Negative

gradient

𝜃𝑗

Here?Here?

• When 𝜃𝑗𝑘+1 is close to 𝜃𝑗

𝑘

• When 𝐽(𝜃𝑘+1) is close to 𝐽(𝜃𝑘) [Error does not change]

28Copyright © 2018. Data Science Dojo

Page 29: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1

Batch Gradient Descent

•How do we incorporate all our data?

•Loop!

For j from 0 to m:

•ℎ𝜃 is updated only once the loop has completed

•Weaknesses?

𝜃𝑗𝑘+1 ≔ 𝜃𝑗

𝑘 − 𝛼1

𝑛

𝑖=1

𝑛

ℎ𝜃 𝑥𝑖 − 𝑦𝑖 𝑥𝑗𝑖

𝜃𝑗𝑘+1 ≔ 𝜃𝑗

𝑘 − 𝛼𝜕

𝜕𝜃𝑗𝐽(𝜃𝑘)

Each represents one feature𝜃𝑗

29Copyright © 2018. Data Science Dojo

Page 30: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1

Batch Gradient Descent

•Loop!

For j from 0 to m:

𝜃𝑗𝑘+1 ≔ 𝜃𝑗

𝑘 − 𝛼1

𝑛

𝑖=1

𝑛

ℎ𝜃 𝑥𝑖 − 𝑦𝑖 𝑥𝑗𝑖

30Copyright © 2018. Data Science Dojo

Page 31: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1

Stochastic Gradient Descent

• Consider an alternative approach:

• ℎ𝜃 is updated when inner loop is complete

• If the training set is big, converges quicker than batch

• May oscillate around a minimum of 𝐽(𝜃) and never converge

for i from 1 to n:for j from 0 to m:

𝜃𝑗𝑘+1 ≔ 𝜃𝑗

𝑘 − 𝛼 ℎ𝜃 𝑥𝑖 − 𝑦𝑖 𝑥𝑗𝑖

* We're now only taking one random observation at a time as a sample, instead of averaging across observations

31Copyright © 2018. Data Science Dojo

Page 32: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1

Batch vs. Stochastic

Which is the better to use? It depends.

Batch Gradient

Descent

Stochastic Gradient

Descent

FunctionUpdates hypothesis by

scanning whole dataset

Updates hypothesis by

scanning one training

sample at a time

Rate of convergence Slowly

Quickly

(but may oscillate at

minimum)

Appropriate Dataset

SizeSmall Large

32Copyright © 2018. Data Science Dojo

Page 33: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1

EVALUATING REGRESSION MODELS

33Copyright © 2018. Data Science Dojo

Page 34: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1

Evaluation Metrics for Regression

•Mean Absolute Error (MAE)

•Root-Mean-Square Error (RMSE)• Root-Mean-Square Deviation

•Coefficient of Determination (R2)

34Copyright © 2018. Data Science Dojo

Page 35: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1

Mean Absolute Error

•Mean of residual values

• "Pure" measure of error

𝑀𝐴𝐸 𝜃 =σ𝑖=1𝑛 ℎ𝜃 𝑥𝑖 − 𝑦𝑖

𝑛

35Copyright © 2018. Data Science Dojo

Page 36: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1

Mean Absolute Error - Example

𝑦 = 36, 19, 34, 6, 1, 45

ℎ𝜃 𝑥 = 27,−2.6, 13, −7.3, −2.6, 48

ℎ𝜃 𝑥 − 𝑦 = 9, 21.6, 21, 13.3, 3.6, 3

𝑀𝐴𝐸 𝜃 =71.5

6= 11.9

36Copyright © 2018. Data Science Dojo

Page 37: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1

Root-Mean-Square Error

•Square root of mean of squared residuals

•Penalizes large errors more than small

•Good measure to use to accentuate outliers

𝑅𝑀𝑆𝐸 𝜃 =σ𝑖=1𝑛 ℎ𝜃 𝑥𝑖 − 𝑦𝑖 2

𝑛

37Copyright © 2018. Data Science Dojo

Page 38: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1

RMSE - Example

𝑦 = 36, 19, 34, 6, 1, 45

ℎ𝜃 𝑥 = 27,−2.6, 13, −7.3, −2.6, 48

ℎ𝜃 𝑥 − 𝑦 2 = 81, 467, 441, 177, 13, 9

𝑅𝑆𝑀𝐸 𝜃 =1187

6= 14.1

38Copyright © 2018. Data Science Dojo

Page 39: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1

Coefficient of Determination (R2)

where

𝑆𝑆𝑟𝑒𝑠 – Sum of squared residuals (i.e. total squared error)

𝑆𝑆𝑡𝑜𝑡 –Sum of squared differences from mean (i.e. total variation in dataset)

Result: Measure of how well the model explains the data• "Fraction of variation in data explained by model"

𝑅2 = 1 −𝑆𝑆𝑟𝑒𝑠𝑆𝑆𝑡𝑜𝑡

𝑆𝑆𝑟𝑒𝑠 =

𝑖=1

𝑛

ℎ𝜃 𝑥𝑖 − 𝑦𝑖2

𝑆𝑆𝑡𝑜𝑡 =

𝑖=1

𝑛

𝑦𝑖 − ത𝑦2

39Copyright © 2018. Data Science Dojo

Page 40: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1

REGULARIZATION

40Copyright © 2018. Data Science Dojo

Page 41: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1

OverfittingP

rice

Size

𝜃0 + 𝜃1𝑥

Pri

ce

Size

𝜃0 + 𝜃1𝑥 + 𝜃2𝑥2

Pri

ce

Size

𝜃0 + 𝜃1𝑥 + 𝜃2𝑥2 + 𝜃3𝑥

3 + 𝜃4𝑥4

41Copyright © 2018. Data Science Dojo

Page 42: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1

Intuition

𝐽′ 𝜃 = 𝐽 𝜃 + 𝑃𝑒𝑛𝑎𝑙𝑡𝑦

Pri

ce

Size

𝜃0 + 𝜃1𝑥 + 𝜃2𝑥2 + 𝜃3𝑥

3 + 𝜃4𝑥4 𝜃0 + 𝜃1𝑥 + 𝜃2𝑥

2 + 𝜃3𝑥3 + 𝜃4𝑥

4

Pri

ce

Size

Ensure Small

42Copyright © 2018. Data Science Dojo

Page 43: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1

Definitions

•Two most common methods• L1 regularization

• lasso regression

• L2 regularization • ridge regression

• weight decay𝐽𝐿2 𝜃 =

1

2𝑛

𝑖=1

𝑛

ℎ𝜃 𝑥𝑖 − 𝑦𝑖2+ 𝜆

𝑗=1

𝑚

𝜃𝑗2

𝐽𝐿1 𝜃 =1

2𝑛

𝑖=1

𝑛

ℎ𝜃 𝑥𝑖 − 𝑦𝑖2+ 𝜆

𝑗=1

𝑚

𝜃𝑗

43Copyright © 2018. Data Science Dojo

Page 44: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1

Regularized Regression

•Find the best fit

•Keep the 𝜃𝑗 terms as small as possible.

•λ is a user-set parameter which controls the trade off

𝐽𝐿2 𝜃 =1

2𝑛

𝑖=1

𝑛

ℎ𝜃 𝑥𝑖 − 𝑦𝑖2+ 𝜆

𝑗=1

𝑚

𝜃𝑗2𝐽𝐿1 𝜃 =

1

2𝑛

𝑖=1

𝑛

ℎ𝜃 𝑥𝑖 − 𝑦𝑖2+ 𝜆

𝑗=1

𝑚

𝜃𝑗

44Copyright © 2018. Data Science Dojo

Page 45: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1

Regularized Regression

•Size of 𝜆 important• 𝜆 too high => no fitting

• 𝜆 too low => no regularization

𝐽𝐿2 𝜃 =1

2𝑛

𝑖=1

𝑛

ℎ𝜃 𝑥𝑖 − 𝑦𝑖2+ 𝜆

𝑗=1

𝑚

𝜃𝑗2𝐽𝐿1 𝜃 =

1

2𝑛

𝑖=1

𝑛

ℎ𝜃 𝑥𝑖 − 𝑦𝑖2+ 𝜆

𝑗=1

𝑚

𝜃𝑗

45Copyright © 2018. Data Science Dojo

Page 46: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1

QUESTIONS

46Copyright © 2018. Data Science Dojo

Page 47: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1

Unsupervised Learning andK-Means Clustering

Page 48: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1

Unsupervised Learning

• Trying to find hidden structure in unlabeleddata

• No error or reward signal to evaluate a potential solution. No need to pick a response class.

• Common techniques: K-Means clustering, hierarchical clustering, hidden Markov models, etc.

48Copyright (c) 2018. Data Science Dojo

Page 49: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1

Choose Number of Clusters

Example 1 (domain knowledge / practicalities): Clothing sizes• Tailor-made for each person is too

expensive• One-size-fits-all: does not work!• Groups people of similar sizes together to

make “small”, “medium”, and “large” t-shirts

49Copyright (c) 2018. Data Science Dojo

Page 50: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1

Choose Number of Clusters

Example 2 (via evaluation): Patient Segmentation• Subdivide patients into

distinct subsets based on disease characteristics

• Where any subset may conceivably be selected as a segment and then be targeted with care models and intervention programs tailored to their needs.

50Copyright (c) 2018. Data Science Dojo

Page 51: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1

K-Means Clustering

• Partitions data points into similarity clusters

• Unsupervised technique: there is no partitioning into a learning or a test set in unsupervised learning

• Useful in grouping observations

• Only works for numeric data

51Copyright (c) 2018. Data Science Dojo

Page 52: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1

Data Preparation

• Transform categorical variables into numeric• Standardize•Reduce dimensionality

52

Age Pclass.1 Pclass.2 Pclass.3 Sex.female Sex.male

19 0 1 0 0 1

28 1 0 0 1 0

64 0 0 1 0 1

Often called“dummy variables” or “one-hot encoding”

Copyright (c) 2018. Data Science Dojo

Page 53: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1

Euclidean Distance

Determine intra- and inter-cluster similarity

53

x1, y1

x2, y2

Copyright (c) 2018. Data Science Dojo

Intra-cluster distancesare minimized

Inter-cluster distancesare maximized

Page 54: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1

K-Means Clustering (1/2)

1 2

Page 55: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1

K-Means Clustering (2/2)

3 4 5

The positions of the cluster centers are determined by the mean of all the points in the cluster.

Page 56: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1

K-Means Clustering

56Copyright (c) 2018. Data Science Dojo

K=3

Page 57: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1

K-Means Clustering Algorithm

57

Suppose set of data points: { x1, x2, x3……….xn}• Step 0: Decide the number of clusters, K=1,2,…k.• Step 1: Place centroids at random locations

➢ c1, c2,….ck

• Step 2: Repeat until convergence:

{ for each point xi find nearest centroid cj (eg. Euclidean distance)

assign the point xi to cluster j

for each cluster j = 1..k calculate new centroid cj

cj=mean of all points xi assigned to cluster j in previous step

}• Step 3: Stop when none of the cluster assignments change

Copyright (c) 2018. Data Science Dojo

Page 58: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1

K-Means Clustering

• Minimizes aggregate intra-cluster distance

• Measure squared distance from point to center of its cluster.

𝑗=1

𝐾

𝑥∈𝑔𝑗

𝐷 𝑐𝑗, 𝑥2

• Could converge to local minimum

• Different starting points very different results

• Run many times with random starting points

• Nearby points may not be assigned to the same cluster

Page 59: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1

• Strengths• Simple: easy to understand and to implement

• Efficient: linear time, minimal storage

• Weaknesses • Mean must be well defined

• The user needs to specify k

• Algorithm is sensitive to outliers

59Copyright (c) 2018. Data Science Dojo

K-Means Clustering

Page 60: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1

Finding K with Elbow Method

60

Option 1 - Percentage of variance

explained as a function of the

number of clusters.

Goal - Choose a number of clusters so that adding

another cluster doesn't give much better modelling

of the data.

Option 2 -Total of the

squared distances of cluster

point to center.

Copyright (c) 2018. Data Science Dojo

Page 61: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1

QUESTIONS

61Copyright (c) 2018. Data Science Dojo

Page 62: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1

Big Data Engineering

Page 63: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1

• Introduction

• A key problem – machine learning at scale

• Distributed computing with Apache Hadoop & Hive

• Machine learning at scale with Apache Mahout

• Distributed computing v2.0 – Apache Spark

Agenda

Copyright (c) 2018. Data Science Dojo 63

Page 64: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1

5 Vs of Big Data

Data at restTerabytes to exabytes

of existing data to

process

Velocity

Data in motionStreaming data,

milliseconds to

seconds to respond

Variety

Data in many

formsStructured,

unstructured, text, and

multimedia

Veracity

Data in doubtUncertainty due to

data inconsistency and

incompleteness,

ambiguities, latency,

deception, and model

approximations

Value

Data can have

different valueNot all bytes are

created equal

$$$$

$

$

$

$$ $

$

$

▪ Goal: As data scientists we want cost-effective access to the

raw materials for our data products!

Copyright (c) 2018. Data Science Dojo 64

Page 65: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1

MACHINE LEARNING AT SCALE

65Copyright © 2018. Data Science Dojo

Page 66: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1

OSS R Limits

▪ Single core

▪ Single threaded

Model A Model B Model C

Quad Core Laptop

Copyright (c) 2018. Data Science Dojo 66

Page 67: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1

•Single core

•Single threaded

•All in memory (RAM)

•Vectors & Matrices capped at 4,294,967,295 elements (rows) if 32-bit version; 2^32 - 1

OSS R Limits

Copyright (c) 2018. Data Science Dojo 67

Page 68: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1

OSS R Limits: RAM

• All in memory (RAM)

Laptop Example:

𝑀𝑎𝑥 𝐷𝑎𝑡𝑎 𝐿𝑖𝑚𝑖𝑡 = 𝑇𝑜𝑡𝑎𝑙 𝑅𝐴𝑀 𝐴𝑐𝑐𝑒𝑠𝑠 𝑥 80% − 𝑁𝑜𝑟𝑚𝑎𝑙 𝑅𝐴𝑀 𝑈𝑠𝑎𝑔𝑒

𝑀𝑎𝑥 𝐷𝑎𝑡𝑎 𝐿𝑖𝑚𝑖𝑡 = 5.9 𝑔𝑏 𝑥 80% − 3.2gb𝑀𝑎𝑥 𝐷𝑎𝑡𝑎 𝐿𝑖𝑚𝑖𝑡 = ~1.52𝑔𝑏

*R data frames actually bloats data files by ~3x𝑅 𝐷𝑎𝑡𝑎 𝐿𝑖𝑚𝑖𝑡 = ~1.52𝑔𝑏 ÷ 3 = ~506.7𝑚𝑏

Copyright (c) 2018. Data Science Dojo 68

Page 69: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1

OSS R Limits: RAM

Azure’s VM with largest RAM*:

*Data collected 06/07/2017

24x7x52 Annual Cost: $116,938.44!

𝑀𝑎𝑥 𝐷𝑎𝑡𝑎 𝐿𝑖𝑚𝑖𝑡 = 2000𝑔𝑏 𝑥 80% − 1𝑔𝑏𝑀𝑎𝑥 𝐷𝑎𝑡𝑎 𝐿𝑖𝑚𝑖𝑡 = ~1600𝑔𝑏𝑅 𝐷𝑎𝑡𝑎 𝐿𝑖𝑚𝑖𝑡 = ~1600𝑔𝑏 ÷ 3 = ~533.33 𝑔𝑏

Copyright (c) 2018. Data Science Dojo 69

Page 70: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1

Machine Learning Scaling

•Hadoop

•Spark

•H20

•Microsoft R

Server

Distributed

•Azure ML

•AWS ML

•Big ML

•Cloud Virtual

Machines

Cloud

•R

•Python

•SAS

Programming

•Excel

Programs

This only gets us so far! Big data scale!

Copyright (c) 2018. Data Science Dojo 70

Page 71: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1

DISTRIBUTED COMPUTING WITH APACHE HADOOP

71Copyright © 2018. Data Science Dojo

Page 72: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1

Turn Back The Clock, The Mainframe

• “Big Iron”• Backbone of computing for

decades.• Still widely used.• “Scale-up” model of shared

computing.• Core platform is cost effective,

ecosystem is not (e.g., software licensing).

• The original VM host!

Copyright (c) 2018. Data Science Dojo 72

Page 73: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1

Distributed Computing

Copyright (c) 2018. Data Science Dojo 73

Page 74: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1

Cloud Computing

• Conceptually – a combination of mainframe and distributed computing.

• VM hosts are now the “Big Iron”.

• Many VMs work together to distribute workloads.

• Some workloads on dedicated HW (e.g., SAP HANA).

Copyright (c) 2018. Data Science Dojo 74

Page 75: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1

Scaling Computational Power

▪ Horizontal Scaling, Scaling OUT

▪ Commodity hardware, distributedNew Scaling:

▪ Vertical Scaling, Scaling UP

▪ High performance computersOld Scaling:

Copyright (c) 2018. Data Science Dojo 75

Page 76: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1

What is Hadoop?

• OSS Platform for distributed computing over Internet-scale data.

• Originally built at Yahoo!

• Implementation of ideas (e.g., MapReduce) published by Google.

• The de facto standard big data platform.

• Named after a stuffed animal belonging to Doug Cutting’s son.

Copyright (c) 2018. Data Science Dojo 76

Page 77: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1

Distributed batch processing engine for big data.

Hadoop at Base

Storage Compute

Copyright (c) 2018. Data Science Dojo 77

Page 78: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1

HDFS & MapReduce

60gb of Tweets

1 Computer

Processing: 30 hours

60gb

Copyright (c) 2018. Data Science Dojo 78

Page 79: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1

HDFS & MapReduce

60gb of Tweets

Processing: 15 hours

30gb

2 Computers

Copyright (c) 2018. Data Science Dojo 79

Page 80: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1

HDFS & MapReduce

60 Gb of Tweets

Processing: 10 hours

20Gb

3 Computers

Copyright (c) 2018. Data Science Dojo 80

Page 81: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1

Most Cases, Linear Scaling Of Processing Power

Number of Computers Processing Time (hours)

1 30

2 15

3 10

4 7.5

5 6

6 5

7 4.26

8 3.75

9 3.33

Copyright (c) 2018. Data Science Dojo 81

Page 82: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1

Head Node

(Named Node)Data Nodes

If dogs were servers…

Copyright (c) 2018. Data Science Dojo 82

Page 83: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1

DataNode 2

DataNode 1

DataNode 3

Partition 3

Partition 2

Partition1

HDFS

HDFS Partitioning

Copyright (c) 2018. Data Science Dojo 83

Page 84: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1

B1B2

DataNode 1

O1O2

DataNode 3

G1G2

HDFS Redundancy

DataNode 2

O1 O2B1

B2G2G1

Copyright (c) 2018. Data Science Dojo 84

Page 85: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1

MapReduce – Sandwich Analogy

Copyright (c) 2018. Data Science Dojo 85

Page 86: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1

Limitations with MapReduce

• Lot of code to perform the simplest task

• Slow

• Troubleshooting multiple computers

• Good devs are scarce

• Expensive certifications

Copyright (c) 2018. Data Science Dojo 86

Page 87: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1

Hive

• Abstraction built on top of MapReduce & HDFS.

• Makes Hadoop look like an RDBMS (e.g., coding in SQL).

• Developed by Facebook to democratize Hadoop.

• Applies structure to data at runtime (“schema on read”).

Copyright (c) 2018. Data Science Dojo 87

Page 88: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1

Hive Jobs

HiveQL

Statement

Translation &

Conversion

MapReduce

Job

Copyright (c) 2018. Data Science Dojo 88

Page 89: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1

Word Count Revisited

vs.SELECT word,

COUNT(*) AS

word_count

FROM words

GROUP BY word;

Copyright (c) 2018. Data Science Dojo 89

Page 90: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1

Caution

SELECT * FROM ANYTHING: This brings back everything. Everything

doesn’t fit on a single computer.

JOIN: Join will take hours or days to perform and eat up all cluster

bandwidth for everyone else trying to use it in the queue.

ORDER BY: Sorting is very computationally expensive.

Sub Queries: A sub query essentially creates a secondary table, which

will be huge in HIVE.

Interactivity: SQL in DBMS is interactive because it's almost

instantaneous.

Copyright (c) 2018. Data Science Dojo 90

Page 91: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1

HDInsight

Hadoop Implementations

Copyright (c) 2018. Data Science Dojo 91

Page 92: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1

Blob Storage

Hadoop in Azure

HDInsight

Azure Data

Lake Store

Compute

HDFS

Storage

Copyright (c) 2018. Data Science Dojo 92

Page 93: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1

Mahout

• Distributed Machine Learning platform.

• Built on top of MapReduce and HDFS.

• Script-based and command line interfaces.

• R-like language implementation.

Copyright (c) 2018. Data Science Dojo 93

Page 94: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1

Data

Node 2

Data

Node 1

Data

Node 3

Partition 3

Partition 2

Partition1

Distributed Random Forest

HDFS

Partitioning

Copyright (c) 2018. Data Science Dojo 94

Page 95: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1

Data

Node 2

Data

Node 1

Data

Node 3

Data

Shuffle

Distributed Random Forest

Copyright (c) 2018. Data Science Dojo 95

Page 96: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1

NODE

A

NODE

BNODE

CBag 1 Bag 2 Bag 3

Decision Tree A Decision Tree B Decision Tree C

Distributed Random Forest

Copyright (c) 2018. Data Science Dojo 96

Page 97: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1

Processing Times - Machine Learning

Data

Cleaning

Training

Prediction

Hours

Days

Milliseconds

• Large scale systems are only needed for training

• Phones can use models outputted by mahout to predict new data

• After a model is trained, save the model to any IO file type and reload it where you want

Bottleneck

Copyright (c) 2018. Data Science Dojo 97

Page 98: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1

Distributed computing v2.0 – Apache spark

Copyright (c) 2018. Data Science Dojo 98

Page 99: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1

What is Spark?

• “A fast and general engine for large-scale data processing.”

• Designed to incorporate the goodness of Hadoop and address Hadoop’s shortcomings.

• Can complement Hadoop via integration with both HDFS and Hive.

Copyright (c) 2018. Data Science Dojo 99

Page 100: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1

Why Spark? Improved Perf!

▪ Up to 10x faster than Hadoop working with data from disk.*

▪ Up to 100x faster working with data stored in memory!*

* benchmark is without Apache Yarn

Copyright (c) 2018. Data Science Dojo 100

Page 101: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1

Daytona GraySort Contest: Sort 100 TB of data!

Previous World Record:• Method: Hadoop• Yahoo!• 72 Minutes• 2100 Nodes

2014:• Method: Spark• Databricks• 23 Minutes• 206 Nodes

3x faster on 10x fewer machines!

Source: https://databricks.com/blog/2014/10/10/spark-petabyte-sort.html

Big Data, Faster!

Copyright (c) 2018. Data Science Dojo 101

Page 102: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1

Conceptual Architecture

Copyright (c) 2018. Data Science Dojo 102

Page 103: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1

Spark and Hadoop

YAR

N

HDFS

MapReduce

HiveJava APISpark SQL

Spark Streamin

gMLlib

▪ Spark can be deployed on a Hadoop cluster and share cluster resources via YARN.

▪ Spark, however, does not require Hadoop!

Copyright (c) 2018. Data Science Dojo 103

Page 104: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1

QUESTIONS

104Copyright © 2018. Data Science Dojo

Page 105: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1

Interpreting Findings from Machine Learning

Page 106: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1

Outline

• Metrics for a machine learning problem

• Examples in health (Case Study)

• Interpretation of a metric

• Conclusion

106Copyright (c) 2018. Data Science Dojo

Page 107: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1

Classification

• When the number of possible predictions is finite then it is a classification problem e.g.,• Benign Vs. malignant tumor (2 possible

predictions)

107Copyright (c) 2018. Data Science Dojo

Page 108: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1

Right Metrics for Classification Problem

Following are the go-to metrics for evaluating any classification problem including the health care applications.

• Precision, recall, specificity, F1 score

• ROC, AUC

• Accuracy

• Log-Loss

• Root Mean Squared Error (useful when classification is done for predictions)

• Some times one metric does not clarify the whole picture.

• Therefore, multiple metrics should be considered for evaluating classifiers

• Evaluation metrics for binary classification (two possible classes). However, the concepts can easily be extended to M-ary classification (M possible classes)

Copyright (c) 2018. Data Science Dojo 108

Page 109: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1

Interpreting Metrics for Imbalanced Datasets

• So, which model is the better one?

Copyright (c) 2018. Data Science Dojo 109

Page 110: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1

Cost Difference Between FP (Type I) and FN (Type 2) Errors

• Imagine the datasets for two applications where positive examples represent• Chronic schizophrenia patients having suicidal tendencies

(dataset 1)• Fitzpatrick scale for human skin color (dataset 2)

• Which model is better for dataset 1 and which model is better for dataset 2?

• For both models and the datasets Log-Loss is not a useful metric as it is the same for both

Model Accuracy Precision Recall F1 Score AUC Log-Loss

Model 1 0.97 1 0.83 0.91 0.85 0.2

Model 2 0.94 0.75 1 0.86 0.8 0.2

Copyright (c) 2018. Data Science Dojo 110

Page 111: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1

Cost of FP Vs. FN for Dataset 1

• The FP represents the patients that actually are not chronic schizophrenic and are wrongfully classified as schizophrenic

• The FN represents the patients that actually are chronic and are wrongfully classified as -ve

• The cost associated with FP and FN is very skewed. • FN costs way too more than that of FP. FP costs a few more

tests while FN can cost the human life

• For dataset 1, the goal should be to minimize the FN. That is, the model having higher Recall should be given preference for dataset 1

• Since Model 2 has a higher Recall value, it is the preferred model over Model 1

Copyright (c) 2018. Data Science Dojo 111

Page 112: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1

FP Vs. FN for Dataset 2

• It is assumed that the dataset 2 is for the Fitzpatrick scale for human skin color. There are 6 possible predictions in {Type I, Type II, …., Type VI}

• For simplicity, let’s assume we are interested in identifying the Type I from the rest

• Fitzpatrick scale is useful for skincare (and cosmetics industry)

Copyright (c) 2018. Data Science Dojo 112

Page 113: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1

FP Vs. FN for Dataset 2

• The FP for this dataset represents the patients who actually have not a Type I skin tone and are wrongfully classified as Type I

• The FN represents the patients that actually have Type I skin tone and are wrongfully classified as other than Type I

• The costs associated with FP and FN are similar

• For such datasets the goal should be to minimize both the FN and the FP

• Hence, the model having higher Accuracy together with the higher AUC should be given more weightage

• Therefore Model 1 is a better classifier where the cost of error for FP and FN is symmetric

Copyright (c) 2018. Data Science Dojo 113

Page 114: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1

Interpreting Metrics for More +veExamples

Copyright (c) 2018. Data Science Dojo 114

Page 115: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1

Same Cost of Type I and Type II Errors

Model Accuracy Precision Recall F1 Score AUC

Model 1 0.94 1 0.67 0.8 0.8

Model 2 0.94 0.75 1 0.86 0.9

Model 3 0.94 0.93 1 0.964 0.8

Model 4 0.94 1 0.92 0.958 0.9

• AUC handles the more +ve and more –ve examples the same way

• F1 score for Model 3 and 4 is very similar. The reason is that because of large number of +ve examples and the fact that F1 score depends (read deteriorates) on +ve examples’ misclassification only

• For imbalanced data F1 score and AUC are important. However, F1 score becomes more important when there are fewer +veexamples, which is quite common in healthcare

Copyright (c) 2018. Data Science Dojo 115

Page 116: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1

Which Classifier is Better?

Classifier Accuracy Precision Recall F1 Score AUC

Model 1 0.90 0.87 0.88 0.875 0.97

Model 2 0.91 0.92 0.83 0.873 0.96

• Classifiers

• Support Vector Machines (Model 1)

• KNN (Model 2)

• As per the accuracy, Model 2 is slightly better than the Model 1

• As discussed, accuracy’s calculation does not take into account the difference in the costs associated with the FP (Type I) and FN (Type II) errors

Copyright (c) 2018. Data Science Dojo 116

Page 117: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1

AUC

• The AUC for both the classifiers is close to 1, which is desirable. That is, the TPR is higher and the FPR is lower for both the classifiers

• As per the AUC, Model 1 is slightly better

• Accuracy of model is slightly higher and AUC is slightly higher for the other model. So, the accuracy and AUC are not helpful in this case

• So the question still remains to be answered: which Model is the better one?

Copyright (c) 2018. Data Science Dojo 117

Page 118: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1

Recall, Precision, F1 Score

• FP (examples predicted as Malignant that actually are Benign) costs the patient a few more tests and eventually his/her money

• While FN (examples predicted as Benign that actually are Malignant) costs the human life

• In this case the choice for the health care provider’s preference for the model is obvious: chose the model with a lower FN. FN is inversely proportional to the Recall

• The more the Recall the better, that is, the bigger proportion of Malignant patients are identified correctly.

Copyright (c) 2018. Data Science Dojo 118

Page 119: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1

Answer: The Higher Recall

• Usually, more Recall and more Precision are preferred. However in this case it is acceptable to identify the patients as FP, conduct a few more tests and be more certain. This results in the lower Precision

• Though, both models have the similar F1 score and Model 2’s Precision is more than that of Model 1

• Higher Recall value for Model 1 that makes it a superior classifier in this example.

Copyright (c) 2018. Data Science Dojo 119

Page 120: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1

Balanced Dataset Example: Early stroke detection and diagnosis

• Stroke is a frequent disease that has affected 500 million people world wide

• It is the leading cause of death in China and the 5th

largest in the US *

• Let’s consider the F1 score, AUC and Log-Loss as the possible evaluation metrics

* Artificial intelligence in healthcare: past, present and future (https://svn.bmj.com/content/2/4/230)

Models F1 score AUC Log-Loss

Model 1 0.88 0.94 0.28

Model 2 0.97 0.98 0.6

Copyright (c) 2018. Data Science Dojo 120

Page 121: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1

Interpreting findings for the case of balanced dataset

• Model 2 has the better• F1 score

• AUC

• However, as per the Log-Loss, Model 1 is the better one

• Though the data is balanced, the cost of Type I is different from that of Type II error. This gives more credence to F1 score and AUC than the Log-Loss.

• AUC can be reasonable even for inferior models. As seen in this example where it’s 0.94 for the worse and 0.98 for the better model

• Therefore, more importantly, it is the higher F1 score for Model 2 that makes more sense in this scenario.

Copyright (c) 2018. Data Science Dojo 121

Page 122: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1

Case Study: The Henry Ford ExercIseTesting (FIT) Project

• Data and evaluation results are obtained from the journal article [1] published on April 18, 2018

• The clinical research dataset consists of 23,095 patients, collected by the FIT project to investigate relative performance of different classification techniques for predicting the individuals at risk of developing hypertension using medical records of cardiorespiratory fitness.

• The study compares the performance of six different ML models for predicting the individuals at risk of developing hypertension using cardiorespiratory fitness data.

• Using different validation methods, the RTF model on the dataset has shown the best performance (AUC = 0.93) which outperforms the models of the previous studies.

[1] Sakr S, Elshawi R, Ahmed A, Qureshi WT, Brawner C, et al. (2018) Using machine learning on cardiorespiratory fitness data for

predicting hypertension: The Henry Ford ExercIse Testing (FIT) Project. PLOS ONE 13(4): e0195344.

https://doi.org/10.1371/journal.pone.0195344

Copyright (c) 2018. Data Science Dojo 122

Page 123: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1

The Henry Ford ExercIse Testing (FIT) Project

• The study compares the performance of six different ML models for predicting the individuals at risk of developing hypertension using cardiorespiratory fitness data.

• Using different validation methods, the RTF model on the dataset has shown the best performance (AUC = 0.93) which outperforms the models of the previous studies.

[1] Sakr S, Elshawi R, Ahmed A, Qureshi WT, Brawner C, et al. (2018) Using machine learning on cardiorespiratory fitness data for

predicting hypertension: The Henry Ford ExercIse Testing (FIT) Project. PLOS ONE 13(4): e0195344.

https://doi.org/10.1371/journal.pone.0195344

Copyright (c) 2018. Data Science Dojo 123

Page 124: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1

AUC Curves for the Different Machine Learning Models using SMOTE evaluated using 10-fold cross-validation

• RTF has the best ROC curve. • AUC appears to be the maximum

Copyright (c) 2018. Data Science Dojo 124

Page 125: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1

The Metrics

• The RTF (Random Tree Forest) model achieves the highest AUC (0.93), F 1 Score (86.70%), sensitivity (69.96%) and specificity (91.71%).

• What does the higher specificity for RTF mean?• Remember that specificity is the true negative recognition rate:

TN/(TN+FP)• The higher the specificity the lower the FP• Hence fewer patients are tested further

• What does the higher sensitivity/recall for RTF mean?• Fewer FN and hence fewer +ve patients sneak through the ML

paradigm

Copyright (c) 2018. Data Science Dojo 125

Page 126: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1

A Note for ML Enthusiasts

• The results show that it is not necessary that the more complex the machine learning model, the better prediction accuracy that can be achieved. Simpler models can perform better in some cases as well.

• The results have also shown that it is critical to carefully explore and evaluate the performance of the machine learning models using various model evaluation methods as the prediction accuracy can significantly differ.

Copyright (c) 2018. Data Science Dojo 126

Page 127: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1

When to Use a Particular Metric?

• For balanced class accuracy is a good metric

• AUC is a good metric for balanced data, however, it is more effective for imbalanced dataset

• If there is a dominant class (imbalanced dataset) then give more importance to AUC and F1 score

• If the goal is to classify the smaller class better, irrespective of it being a +veor a –ve class, then AUC is a good measure

• F1 score is important when the +ve class is small.

• If the application requires to have the minimum FN then go for more recall

• If the application requires to have the minimum FP then go for more precision

• Higher recall/sensitivity is better for identifying the +ve examples

• Higher specificity is better for identifying the -ve examples

• Though, rarely used, Log-Loss is important for absolute probabilistic difference. It is important in some applications

• RMSE is useful when classification algorithms are evaluated for predictions

Copyright (c) 2018. Data Science Dojo 127

Page 128: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1

Conclusion

• One metric is good in one scenario and may not work for another

• Metrics should be chosen based on the balance of the data

• If +ve examples are fewer or are more important to classify then it gives credence to certain metrics over others

• In general, it is a better idea to calculate more metrics and then decide in favor of a particular model

• Generally, a good ML algorithm strikes a good balance between the Precision and Recall.

• If the difference in costs of Type I and Type II errors is large then Precision and Recall are the preferred metrics

• For health care applications the best scenario is when both recall and specificity are maximum

Copyright (c) 2018. Data Science Dojo 128

Page 129: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1

Implementation of Policy RecommendationsPerformance-Based Financing in Health

Page 130: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1

Using Supervised Learning to Select Audit Targets in Performance-Based Financing in Health: An Example from ZambiaBy Dhruv Grover, Sebastian Bauhoff, and Jed Friedman

130Copyright (c) 2018. Data Science Dojo

Page 131: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1

Setting the context

• Zambia operated a pilot project from 2012-4 of performance-based financing of public health centers

• Public health centers are paid for the quantity and quality of the services they deliver

• Public health centres (covering 11% of Zambia’s population) in 10 rural districts participated

131Copyright (c) 2018. Data Science Dojo

Page 132: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1

The Program Improved Certain Indicators But…

• 42% of facilities over-reported at least once in the four quarters measured

• Financing per service incentivized both provision of the service and over-reporting

• Payment for over-reported services undermines the incentive for delivery of the service and is a waste of public resources => we need to minimize over-reporting

132Copyright (c) 2018. Data Science Dojo

Page 133: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1

What Measures Were in Place to Reduce Over-reporting?

• Dedicated district steering committees conducted continuous internal verification by reconciling the facility-reported information with the paper-based evidence

• An independent third-party conducted a one-off external verification process after 2-years of the program operation (costing $22.5k)

133Copyright (c) 2018. Data Science Dojo

Page 134: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1

They Need to Target the Verification So it Identifies Over-reporting Facilities But Isn’t Cost Prohibitive

• The aim of the external verification is to minimize over-reporting whilst minimizing verification costs

• You could independently verify every single facility => this would completely eliminate over-reporting BUT this is probably cost prohibitive

• You could not verify any facilities BUT there is likely to be a substantial amount of over-reporting which may worsen over time as practitioners realise they can take advantage of the lack of verification

134Copyright (c) 2018. Data Science Dojo

Page 135: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1

How to Identify Facilities That are Likely to Have Over-reported?

• It is possible to identify over-reporting using random sampling or machine learning

• They predict over-reporting defined as: • 1 if the difference between the reported and verified

data is > 10% of the reported value

• 0 otherwise

• Using the following input features:• Reported and verified values for the 9 quantity

measures rewarded in the PBF program;

• Control variables

135Copyright (c) 2018. Data Science Dojo

Page 136: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1

Factors to Determine Choice of ML Technique

1. What is the size of the training dataset?

2. Can features be treated as independent variables?

3. Will additional training data become available in the future and need to be incorporated into the model?

4. Is the data linearly separable?

5. Is overfitting expected to be a problem?

6. Are there any speed, performance, memory usage requirements?

136Copyright (c) 2018. Data Science Dojo

Page 137: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1

The Algorithms Learn the Patterns From the Inputs That Indicate That a Facility is at Risk From Over-reporting

• The patterns are learnt on data only from the first quarter

• The models (algorithm + data + parameters) are measured on how well they correctly identify facilities that over-report in the first quarter

• And the models can then apply this learning to predict the risk for other facilities over-reporting on unseen data in subsequent quarters

• On this occasion, random forest outperformed all the other algorithms on all 5 metrics

137Copyright (c) 2018. Data Science Dojo

Page 138: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1

How Does This Change the Targeting of Verification?

• The verifiers can send audit teams to verify data at those facilities that have the highest risk of over-reporting

• They save c.$800 per clinic they didn’t inspect because it wasn’t necessary

• The verifiers need to periodically collect another random sample to re-train the model so that those identified as low-risk taking advantage of the lack of monitoring and over-reporting.

138Copyright (c) 2018. Data Science Dojo

Page 139: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1

How Does This Change the Targeting of Verification?

• The verifiers can send audit teams to verify data at those facilities that have the highest risk of over-reporting

• They save c.$800 per clinic they didn’t inspect because it wasn’t necessary

• The verifiers need to periodically collect another random sample to re-train the model so that those identified as low-risk taking advantage of the lack of monitoring and over-reporting.

139Copyright (c) 2018. Data Science Dojo