mock presentation slides report

48
MOCK PRESENTATION SLIDES REPORT PLEASE NOTE: this document has purely illustrative purposes. Minerva statistical consulting does not guarantee the accuracy and correctness of the content, nor it accepts any responsibility for any damages or losses incurred by its use.

Upload: others

Post on 03-Feb-2022

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: MOCK PRESENTATION SLIDES REPORT

MOCK PRESENTATION SLIDES REPORTPLEASE NOTE: this document has purely illustrative purposes. Minerva statistical consulting does not guarantee the accuracy andcorrectness of the content, nor it accepts any responsibility for any damages or losses incurred by its use.

Page 2: MOCK PRESENTATION SLIDES REPORT

Introduction

Can we select, among several algorithms,

the best performing machine learning

algorithm in predicting house prices?

Research Objective

Dataset used: Boston Housing Data-set.

This dataset contains information collected by the

US Census Service concerning housing in the area

of Boston (Massachusetts, USA).

Data Source:The original dataset is available (Use QR

Code) and has been used extensively

throughout the literature to benchmark

and compare the accuracy of different

algorithms.

In this project, we will use an already “cleaned

dataset” – with no missing values.

Kernel RidgeRegressor

XGBoost Regressor

Light GMB Regressor

Gradient BoostingRegressor

Light GMB Regressor is the best

performing model for this data-set.

Results

2

Page 3: MOCK PRESENTATION SLIDES REPORT

MethodologyProgramming Language And Libraries Used

The programming language used throughout

is : Python 3.9.1

Numpy

Pandas

Matplotlib

Scipy

Xgboost

Libraries and packages used:

3

Page 4: MOCK PRESENTATION SLIDES REPORT

MethodologyData Set OverviewHereafter, we refer to the variable MEDV as our target variable

while the rest (13 variables) are referred to as features.

CRIM: per capita crime rate by town.

ZN: proportion of residential land zoned for lots over 25,000 sq.ft.

INDUS: proportion of non-retail business acres per town.

CHAS: Charles River dummy variable (= 1 if tract bounds river; =0 otherwise).

NOX: nitric oxides concentration (parts per 10 million).

RM: average number of rooms per dwelling.

AGE: proportion of owner-occupied units built prior to 1940.

DIS: weighteddistances to five Boston employmentcenters.

RAD: index of accessibility to radial highways .

TAX: full-value-property-tax-rate per $10,000.

PTRATIO: pupil-teacher ratio by town.

B: 1000(Bk – 0.63)^2 where Bk is the proportion of blacks by town.

LSTAT: % lower status of the population.

MEDV: Medianvalue of owner-occupied homes in $1000s.

CRIM ZN INUDS CHAS NOX RM AGE

DIS RAD TAX PTRATIO B LSTAT MEDV

0 0 0 0 0 0 0

0 0 0 0 0 0 0

Data set information

0# Missing Values

▪ As mentioned above, the data has already been cleaned, as such it does not contain any missing values.

▪ In python, we can double-check this using the isnull() command, which returns the number of missing values.4

Page 5: MOCK PRESENTATION SLIDES REPORT

Variable Plots(Distribution Plots)

Next, we proceed to visually explore the features by plotting the distribution of each variable below.

Variable = DIS Variable = RAD Variable = TAX Variable = BVariable = PTRATIO Variable = LSTAT Variable = MEDV

Variable = CRIM Variable = Zn Variable = INDUS Variable = CHAS Variable = NOX Variable = RM Variable = AGE

5

Page 6: MOCK PRESENTATION SLIDES REPORT

PRE-PROCESSING ICORRECTING FOR SKEWNESS

After plotting , there was evidence of skwness in the data.

Correcting for skwness in the data prior to model fitting is important.

▪ For instance, if the response variable is skewed to the right, the model will be trained on a much

larger number of relatively inexpensive priced homes, and thus, it will be less likely to successfully

predict the range of most expensive houses.

▪ Furthermore, if the values of a certain independent variable (feature) are skewed, depending on

the model, skewness may go against the model assumptions or may make the interpretation of

feature importance difficult.

After various transformations (log,BoxCox), we decided to opt for log transformation.

NOTEWe should keep track of the transformation performed on

the features, because we would need to reverse these

transformation, once we make predictions.6

Page 7: MOCK PRESENTATION SLIDES REPORT

Variable Plots (Distribution) After TransformationAfter correcting for skewness, some variables’ distributions display more symmetrical shapes (see for

instance NOX, LSTAT and DIS) . Hence, the transformations help.

Variable = DIS Variable = RAD Variable = TAX Variable = BVariable = PTRATIO Variable = LSTAT Variable = MEDV

Variable = CRIM Variable = Zn Variable = INDUS Variable = CHAS Variable = NOX Variable = RM Variable = AGE

7

Page 8: MOCK PRESENTATION SLIDES REPORT

Bimodal Variables

Some features displayed a

bi-modal density function,

see for instance features:

INDUS, TAX and RAD.

Therefore, to take this into

account we modelled

these variables as a mixture of gaussians.

This option is available in

the python scikit-learn

package.

8

Page 9: MOCK PRESENTATION SLIDES REPORT

9

CORRELATION HEAT MAP (Features vs Target)

From the correlation

matrix, the features

seem to be significantly

correlated with the

target variable (MEDV),

see last row or last

column on the right. This

is important as we

would want the feature

to be related to the

target variable.

Page 10: MOCK PRESENTATION SLIDES REPORT

10

PRE-PROCESSING II : STANDARDISATION

▪ Next, as customary, we proceed to a standardization of the features.

▪ Although some algorithms are not sensitive to standardisation, it’s

usually customary to standardise the data prior to model building.

▪ We employ the inbuilt function RobustScaler, which is invoked with the

sklearn.preprocessing command.

▪ This scaler uses statistics that are robust to outliers.

▪ More precisely, this scaler removes the median and scales the data

according to the quantile range (defaults to IQR: Interquartile Range).

Page 11: MOCK PRESENTATION SLIDES REPORT

11

PRE-PROCESSING III : REMOVAL OF REDUNDANT FEATURES

▪ Further, we can use the in-built function RFECV (Feature ranking with

recursive feature elimination and cross-validated selection of the best

number of features. ) to help us eliminate redundant features.

▪ The main idea is that if some of the variables don't add new

information, these are redundant, and hence need to be excluded

from the model.

▪ The RFECV only eliminates the ZN variable (this is deemed as

determined by the other variables, hence redundant).

Page 12: MOCK PRESENTATION SLIDES REPORT

12

MODELLING AND PREDICTION

▪ We start by splitting the data into test and train data (# obs = 506).

▪ The size of the test data is 20% the over-all size of the data (0.20 * 506

= 101 obs).

▪ The objective is to compare the performance of the following

machine learning algorithms:

▪ Kernel Ridge Regressor; XGBoost Regressor; Light GMB Regressor;

Gradient Boosting Regressor and select the one that performs the

best.

Page 13: MOCK PRESENTATION SLIDES REPORT

13

HOW DO WE TUNE/PARAMETRIZE

EACH MODEL?

▪ However, before comparing the models’ performance, we need to

decide how to parametrize each model. To select the best model

parametrization, we perform a grid search across several

combinations of parameters and select the best model

parametrization for each model based on the lowest Root Mean

Square Forecast Error (RMSFE)

▪ Then select, amongst all the best parametrized models, the best

overall performing model, again the one with the lowest RMSFE (this

time across models).

Page 14: MOCK PRESENTATION SLIDES REPORT

14

BEST PERFORMANCE, RESULTS

ACROSS MODELS - RMSFE

ModelTraining

set

Validation

set

Kernel Ridge

Regression0.121 0.21

XG Boost Regressor 0.075 0.171

Light GBM Regressor 0.104 0.149

Gradient Boosting

Regressor0.011 0.170

The light GBM

Regressor is the best

performing model

across ALL models,

since it has the lowest

Validation Set RMSFE.

Page 15: MOCK PRESENTATION SLIDES REPORT

15

L-GBM MODEL PREDICTION vs

ACTUAL VALUES - SCATTER PLOT

From the scatter plot we can

see that the predicted and

actual values are highly

correlated, which would

suggest that the model performs well when

compared to actual data.

Page 16: MOCK PRESENTATION SLIDES REPORT

16

Page 17: MOCK PRESENTATION SLIDES REPORT

MOCK PRESENTATION SLIDES REPORTPLEASE NOTE: this document has purely illustrative purposes. Minerva statistical consulting does not guarantee the accuracy andcorrectness of the content, nor it accepts any responsibility for any damages or losses incurred by its use.

Page 18: MOCK PRESENTATION SLIDES REPORT

Introduction

Can we select, among several algorithms,

the best performing machine learning

algorithm in predicting house prices?

Research Objective

Dataset used: Boston Housing Data-set.

This dataset contains information collected by the

US Census Service concerning housing in the area

of Boston (Massachusetts, USA).

Data Source:The original dataset is available (Use QR

Code) and has been used extensively

throughout the literature to benchmark

and compare the accuracy of different

algorithms.

In this project, we will use an already “cleaned

dataset” – with no missing values.

Kernel RidgeRegressor

XGBoost Regressor

Light GMB Regressor

Gradient BoostingRegressor

Light GMB Regressor is the best

performing model for this data-set.

Results

18

Page 19: MOCK PRESENTATION SLIDES REPORT

MethodologyProgramming Language And Libraries Used

The programming language used throughout

is : Python 3.9.1

Numpy

Pandas

Matplotlib

Scipy

Xgboost

Libraries and packages used:

19

Page 20: MOCK PRESENTATION SLIDES REPORT

MethodologyData Set OverviewHereafter, we refer to the variable MEDV as our target variable

while the rest (13 variables) are referred to as features.

CRIM: per capita crime rate by town.

ZN: proportion of residential land zoned for lots over 25,000 sq.ft.

INDUS: proportion of non-retail business acres per town.

CHAS: Charles River dummy variable (= 1 if tract bounds river; =0 otherwise).

NOX: nitric oxides concentration (parts per 10 million).

RM: average number of rooms per dwelling.

AGE: proportion of owner-occupied units built prior to 1940.

DIS: weighteddistances to five Boston employmentcenters.

RAD: index of accessibility to radial highways .

TAX: full-value-property-tax-rate per $10,000.

PTRATIO: pupil-teacher ratio by town.

B: 1000(Bk – 0.63)^2 where Bk is the proportion of blacks by town.

LSTAT: % lower status of the population.

MEDV: Medianvalue of owner-occupied homes in $1000s.

CRIM ZN INUDS CHAS NOX RM AGE

DIS RAD TAX PTRATIO B LSTAT MEDV

0 0 0 0 0 0 0

0 0 0 0 0 0 0

Data set information

0# Missing Values

▪ As mentioned above, the data has already been cleaned, as such it does not contain any missing values.

▪ In python, we can double-check this using the isnull() command, which returns the number of missing values.20

Page 21: MOCK PRESENTATION SLIDES REPORT

Variable Plots(Distribution Plots)

Next, we proceed to visually explore the features by plotting the distribution of each variable below.

Variable = DIS Variable = RAD Variable = TAX Variable = BVariable = PTRATIO Variable = LSTAT Variable = MEDV

Variable = CRIM Variable = Zn Variable = INDUS Variable = CHAS Variable = NOX Variable = RM Variable = AGE

21

Page 22: MOCK PRESENTATION SLIDES REPORT

PRE-PROCESSING ICORRECTING FOR SKEWNESS

After plotting , there was evidence of skwness in the data.

Correcting for skwness in the data prior to model fitting is important.

▪ For instance, if the response variable is skewed to the right, the model will be trained on a much

larger number of relatively inexpensive priced homes, and thus, it will be less likely to successfully

predict the range of most expensive houses.

▪ Furthermore, if the values of a certain independent variable (feature) are skewed, depending on

the model, skewness may go against the model assumptions or may make the interpretation of

feature importance difficult.

After various transformations (log,BoxCox), we decided to opt for log transformation.

NOTEWe should keep track of the transformation performed on

the features, because we would need to reverse these

transformation, once we make predictions.22

Page 23: MOCK PRESENTATION SLIDES REPORT

Variable Plots (Distribution) After TransformationAfter correcting for skewness, some variables’ distributions display more symmetrical shapes (see for

instance NOX, LSTAT and DIS) . Hence, the transformations help.

Variable = DIS Variable = RAD Variable = TAX Variable = BVariable = PTRATIO Variable = LSTAT Variable = MEDV

Variable = CRIM Variable = Zn Variable = INDUS Variable = CHAS Variable = NOX Variable = RM Variable = AGE

23

Page 24: MOCK PRESENTATION SLIDES REPORT

Bimodal Variables

Some features displayed a

bi-modal density function,

see for instance features:

INDUS, TAX and RAD.

Therefore, to take this into

account we modelled

these variables as a mixture of gaussians.

This option is available in

the python scikit-learn

package.

24

Page 25: MOCK PRESENTATION SLIDES REPORT

25

CORRELATION HEAT MAP (Features vs Target)

From the correlation

matrix, the features

seem to be significantly

correlated with the

target variable (MEDV),

see last row or last

column on the right. This

is important as we

would want the feature

to be related to the

target variable.

Page 26: MOCK PRESENTATION SLIDES REPORT

26

PRE-PROCESSING II : STANDARDISATION

▪ Next, as customary, we proceed to a standardization of the features.

▪ Although some algorithms are not sensitive to standardisation, it’s

usually customary to standardise the data prior to model building.

▪ We employ the inbuilt function RobustScaler, which is invoked with the

sklearn.preprocessing command.

▪ This scaler uses statistics that are robust to outliers.

▪ More precisely, this scaler removes the median and scales the data

according to the quantile range (defaults to IQR: Interquartile Range).

Page 27: MOCK PRESENTATION SLIDES REPORT

27

PRE-PROCESSING III : REMOVAL OF REDUNDANT FEATURES

▪ Further, we can use the in-built function RFECV (Feature ranking with

recursive feature elimination and cross-validated selection of the best

number of features. ) to help us eliminate redundant features.

▪ The main idea is that if some of the variables don't add new

information, these are redundant, and hence need to be excluded

from the model.

▪ The RFECV only eliminates the ZN variable (this is deemed as

determined by the other variables, hence redundant).

Page 28: MOCK PRESENTATION SLIDES REPORT

28

MODELLING AND PREDICTION

▪ We start by splitting the data into test and train data (# obs = 506).

▪ The size of the test data is 20% the over-all size of the data (0.20 * 506

= 101 obs).

▪ The objective is to compare the performance of the following

machine learning algorithms:

▪ Kernel Ridge Regressor; XGBoost Regressor; Light GMB Regressor;

Gradient Boosting Regressor and select the one that performs the

best.

Page 29: MOCK PRESENTATION SLIDES REPORT

29

HOW DO WE TUNE/PARAMETRIZE

EACH MODEL?

▪ However, before comparing the models’ performance, we need to

decide how to parametrize each model. To select the best model

parametrization, we perform a grid search across several

combinations of parameters and select the best model

parametrization for each model based on the lowest Root Mean

Square Forecast Error (RMSFE)

▪ Then select, amongst all the best parametrized models, the best

overall performing model, again the one with the lowest RMSFE (this

time across models).

Page 30: MOCK PRESENTATION SLIDES REPORT

30

BEST PERFORMANCE, RESULTS

ACROSS MODELS - RMSFE

ModelTraining

set

Validation

set

Kernel Ridge

Regression0.121 0.21

XG Boost Regressor 0.075 0.171

Light GBM Regressor 0.104 0.149

Gradient Boosting

Regressor0.011 0.170

The light GBM

Regressor is the best

performing model

across ALL models,

since it has the lowest

Validation Set RMSFE.

Page 31: MOCK PRESENTATION SLIDES REPORT

31

L-GBM MODEL PREDICTION vs

ACTUAL VALUES - SCATTER PLOT

From the scatter plot we can

see that the predicted and

actual values are highly

correlated, which would

suggest that the model performs well when

compared to actual data.

Page 32: MOCK PRESENTATION SLIDES REPORT

32

Page 33: MOCK PRESENTATION SLIDES REPORT

MOCK PRESENTATION SLIDES REPORTPLEASE NOTE: this document has purely illustrative purposes. Minerva statistical consulting does not guarantee the accuracy andcorrectness of the content, nor it accepts any responsibility for any damages or losses incurred by its use.

Page 34: MOCK PRESENTATION SLIDES REPORT

Introduction

Can we select, among several algorithms,

the best performing machine learning

algorithm in predicting house prices?

Research Objective

Dataset used: Boston Housing Data-set.

This dataset contains information collected by the

US Census Service concerning housing in the area

of Boston (Massachusetts, USA).

Data Source:The original dataset is available (Use QR

Code) and has been used extensively

throughout the literature to benchmark

and compare the accuracy of different

algorithms.

In this project, we will use an already “cleaned

dataset” – with no missing values.

Kernel RidgeRegressor

XGBoost Regressor

Light GMB Regressor

Gradient BoostingRegressor

Light GMB Regressor is the best

performing model for this data-set.

Results

34

Page 35: MOCK PRESENTATION SLIDES REPORT

MethodologyProgramming Language And Libraries Used

The programming language used throughout

is : Python 3.9.1

Numpy

Pandas

Matplotlib

Scipy

Xgboost

Libraries and packages used:

35

Page 36: MOCK PRESENTATION SLIDES REPORT

MethodologyData Set OverviewHereafter, we refer to the variable MEDV as our target variable

while the rest (13 variables) are referred to as features.

CRIM: per capita crime rate by town.

ZN: proportion of residential land zoned for lots over 25,000 sq.ft.

INDUS: proportion of non-retail business acres per town.

CHAS: Charles River dummy variable (= 1 if tract bounds river; =0 otherwise).

NOX: nitric oxides concentration (parts per 10 million).

RM: average number of rooms per dwelling.

AGE: proportion of owner-occupied units built prior to 1940.

DIS: weighteddistances to five Boston employmentcenters.

RAD: index of accessibility to radial highways .

TAX: full-value-property-tax-rate per $10,000.

PTRATIO: pupil-teacher ratio by town.

B: 1000(Bk – 0.63)^2 where Bk is the proportion of blacks by town.

LSTAT: % lower status of the population.

MEDV: Medianvalue of owner-occupied homes in $1000s.

CRIM ZN INUDS CHAS NOX RM AGE

DIS RAD TAX PTRATIO B LSTAT MEDV

0 0 0 0 0 0 0

0 0 0 0 0 0 0

Data set information

0# Missing Values

▪ As mentioned above, the data has already been cleaned, as such it does not contain any missing values.

▪ In python, we can double-check this using the isnull() command, which returns the number of missing values.36

Page 37: MOCK PRESENTATION SLIDES REPORT

Variable Plots(Distribution Plots)

Next, we proceed to visually explore the features by plotting the distribution of each variable below.

Variable = DIS Variable = RAD Variable = TAX Variable = BVariable = PTRATIO Variable = LSTAT Variable = MEDV

Variable = CRIM Variable = Zn Variable = INDUS Variable = CHAS Variable = NOX Variable = RM Variable = AGE

37

Page 38: MOCK PRESENTATION SLIDES REPORT

PRE-PROCESSING ICORRECTING FOR SKEWNESS

After plotting , there was evidence of skwness in the data.

Correcting for skwness in the data prior to model fitting is important.

▪ For instance, if the response variable is skewed to the right, the model will be trained on a much

larger number of relatively inexpensive priced homes, and thus, it will be less likely to successfully

predict the range of most expensive houses.

▪ Furthermore, if the values of a certain independent variable (feature) are skewed, depending on

the model, skewness may go against the model assumptions or may make the interpretation of

feature importance difficult.

After various transformations (log,BoxCox), we decided to opt for log transformation.

NOTEWe should keep track of the transformation performed on

the features, because we would need to reverse these

transformation, once we make predictions.38

Page 39: MOCK PRESENTATION SLIDES REPORT

Variable Plots (Distribution) After TransformationAfter correcting for skewness, some variables’ distributions display more symmetrical shapes (see for

instance NOX, LSTAT and DIS) . Hence, the transformations help.

Variable = DIS Variable = RAD Variable = TAX Variable = BVariable = PTRATIO Variable = LSTAT Variable = MEDV

Variable = CRIM Variable = Zn Variable = INDUS Variable = CHAS Variable = NOX Variable = RM Variable = AGE

39

Page 40: MOCK PRESENTATION SLIDES REPORT

Bimodal Variables

Some features displayed a

bi-modal density function,

see for instance features:

INDUS, TAX and RAD.

Therefore, to take this into

account we modelled

these variables as a mixture of gaussians.

This option is available in

the python scikit-learn

package.

40

Page 41: MOCK PRESENTATION SLIDES REPORT

41

CORRELATION HEAT MAP (Features vs Target)

From the correlation

matrix, the features

seem to be significantly

correlated with the

target variable (MEDV),

see last row or last

column on the right. This

is important as we

would want the feature

to be related to the

target variable.

Page 42: MOCK PRESENTATION SLIDES REPORT

42

PRE-PROCESSING II : STANDARDISATION

▪ Next, as customary, we proceed to a standardization of the features.

▪ Although some algorithms are not sensitive to standardisation, it’s

usually customary to standardise the data prior to model building.

▪ We employ the inbuilt function RobustScaler, which is invoked with the

sklearn.preprocessing command.

▪ This scaler uses statistics that are robust to outliers.

▪ More precisely, this scaler removes the median and scales the data

according to the quantile range (defaults to IQR: Interquartile Range).

Page 43: MOCK PRESENTATION SLIDES REPORT

43

PRE-PROCESSING III : REMOVAL OF REDUNDANT FEATURES

▪ Further, we can use the in-built function RFECV (Feature ranking with

recursive feature elimination and cross-validated selection of the best

number of features. ) to help us eliminate redundant features.

▪ The main idea is that if some of the variables don't add new

information, these are redundant, and hence need to be excluded

from the model.

▪ The RFECV only eliminates the ZN variable (this is deemed as

determined by the other variables, hence redundant).

Page 44: MOCK PRESENTATION SLIDES REPORT

44

MODELLING AND PREDICTION

▪ We start by splitting the data into test and train data (# obs = 506).

▪ The size of the test data is 20% the over-all size of the data (0.20 * 506

= 101 obs).

▪ The objective is to compare the performance of the following

machine learning algorithms:

▪ Kernel Ridge Regressor; XGBoost Regressor; Light GMB Regressor;

Gradient Boosting Regressor and select the one that performs the

best.

Page 45: MOCK PRESENTATION SLIDES REPORT

45

HOW DO WE TUNE/PARAMETRIZE

EACH MODEL?

▪ However, before comparing the models’ performance, we need to

decide how to parametrize each model. To select the best model

parametrization, we perform a grid search across several

combinations of parameters and select the best model

parametrization for each model based on the lowest Root Mean

Square Forecast Error (RMSFE)

▪ Then select, amongst all the best parametrized models, the best

overall performing model, again the one with the lowest RMSFE (this

time across models).

Page 46: MOCK PRESENTATION SLIDES REPORT

46

BEST PERFORMANCE, RESULTS

ACROSS MODELS - RMSFE

ModelTraining

set

Validation

set

Kernel Ridge

Regression0.121 0.21

XG Boost Regressor 0.075 0.171

Light GBM Regressor 0.104 0.149

Gradient Boosting

Regressor0.011 0.170

The light GBM

Regressor is the best

performing model

across ALL models,

since it has the lowest

Validation Set RMSFE.

Page 47: MOCK PRESENTATION SLIDES REPORT

47

L-GBM MODEL PREDICTION vs

ACTUAL VALUES - SCATTER PLOT

From the scatter plot we can

see that the predicted and

actual values are highly

correlated, which would

suggest that the model performs well when

compared to actual data.

Page 48: MOCK PRESENTATION SLIDES REPORT

48