Machine Learning Approaches for Survival Prediction ofCritically Ill Patients Under Insulin Therapy
Bernardo Marreiros Firme
Thesis to obtain the Master of Science Degree in
Mechanical Engineering
Supervisors: Prof. João Miguel Da Costa SousaEng. Aldo Robles Arévalo
Examination Committee
Chairperson: Prof. Carlos Baptista CardeiraSupervisor: Prof. João Miguel Da Costa Sousa
Member of the Committee: Prof. Jorge Dos Santos Salvador Marques
June 2019
ii
iii
iv
Acknowledgments
First of all, thank the support of my supervisors Professor Joao Sousa and Eng. Aldo Arevalo for all the
guidance and shared knowledge through the development of this work.
A special thanks to my family, for all the autonomy and support given during these years, always with the
most assertive advices to help me to follow the right path, even with some essential deviations.
To my friends, for all the ’work’ extra-school.
Last but not least, a word to Imortal BC, a club that represent a sport that helped me to grow and build
my character as a person.
v
vi
Resumo
Esta dissertacao propoe o desenvolvimento de um modelo capaz de predizer a mortalidade de pacientes sob o
efeito de insulina em UCI, utilizando a base de dados MIMIC-III. A terapia com insulina e crucial para controlar
os nıveis de acucar no sangue em pacientes em estado critico. No entanto, nao ha consenso sobre que controlo
glicemico, intensivo ou convencional, e mais benefico de modo a reduzir a mortalidade.
Gradient boosting e regressao logıstica foram as tecnicas escolhidas apos uma extensiva comparacao entre
varias tecnicas de machine learning. Data sampling foi aplicado para neutralizar o desequilıbrio presente no
conjunto de dados e tecnicas de feature selection, incluindo uma nova abordagem intitulada recursive feature
selection, foram igualmente aplicadas.
No geral, gradient boosting com um total de 187 variaveis obteve o melhor desempenho (AUC de 91, 4±
1.36) para dados coletados nas primeiras 24 horas na UCI, superando o melhor ındice de gravidade, SAPS-II
(AUC de 77, 4± 2.44).
Diferentes tempos de previsao foram testados e o mais proximo da alta medica obteve o melhor desempenho
(AUC de 94, 8± 0.92).
Apos feature selection, um modelo com apenas 7 variaveis obteve um bom desempenho (AUC de 90.2±
1.34). Este modelo foi validado usando dados da base de dados eICU-CRD, alcancando um desempenho
semelhante (AUC de 88.0).
Finalizando, os modelos foram interpretados usando valores SHAP. Assim, identificaram-se as variaveis
que globalmente e individualmente mais afetam os pacientes, dando origem a construcao de paineis clınicos
individualizados. Estes podem ser uma ferramenta importante numa perspectiva de decisoes medicas auxiliadas
por dados.
Palavras-chave: Machine Learning, Previsao de Mortalidade, Insulina, Gradient Boosting, Inter-
pretacao de Modelos
vii
viii
Abstract
This thesis proposes the development of a classification model capable of predict mortality in patients under
insulin therapy in ICU using data from MIMIC-III database. Insulin therapy is crucial to control blood sugar
levels for critical-care patients. However, there is no consensus on which is the most beneficial glucose control,
either intensive or conventional, for these patients to reduce mortality.
Gradient boosting and logistic regression were the chosen modelling techniques after an extensive compari-
son of several machine learning techniques. Data sampling was applied to counteract the imbalance present in
dataset and feature selection techniques, including a novel approach entitled recursive feature selection, were
also applied.
Overall, gradient boosting with a total of 187 features achieved the highest performance (AUC of 91.4±
1.36) for data collected in patients’ first 24 hours in the ICU and outperformed the highest performance among
severity scores, SAPS-II (AUC of 77.4 ± 2.44). Different prediction time-windows were tested and the one
nearer to ICU discharge achieved the highest performance among all tested (AUC of 94.8± 0.92).
After feature selection, a model with only 7 features achieved a good performance (AUC of 90.2± 1.34).
This model was validated using a previously unseen data from the eICU-CRD database, and a similar perfor-
mance was achieved (AUC of 88.0).
Lastly, models were interpreted using SHAP values. Thus, variables that overall and individually most
affect patients were identified, giving rise to the construction of individualized clinical dashboards. These may
be an important tool in a perspective of data-aided decisions by physicians.
Keywords: Machine Learning, Mortality Prediction, Insulin, Gradient Boosting, Model Interpretation
ix
x
Contents
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
Resumo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii
Nomenclature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xix
1 Introduction 1
1.1 Applications of Data Mining in Healthcare . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Mortality in Patients under Insulin Therapy . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Objectives and Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 Insulin Therapy 7
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.1 Homeostatic Regulation of Blood Glucose Levels . . . . . . . . . . . . . . . . . . . . 7
2.1.2 Diabetes Mellitus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.3 Types of Insulin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Insulin Therapy in Intensive Care Units . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.1 Intensive Care Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.2 Dysglycemia . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.3 Intensive Insulin Therapy vs Conventional Insulin Therapy . . . . . . . . . . . . . . . . 10
2.3 Mortality Prediction in Intensive Care Units . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3.1 Previous studies aiming mortality prediction in ICU . . . . . . . . . . . . . . . . . . . 12
2.3.2 Severity Scores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3 Data pre-processing 15
3.1 MIMIC-III Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2 Patients and Variables Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2.1 Inclusion Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2.2 Input Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.3 Data Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
xi
3.3.1 Removal of Outliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.3.2 Missing data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.3.3 Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.4 Feature Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.4.1 Discretization of Time-series Variables . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.4.2 Glycemic Covariates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.4.3 Categorical Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.5 Selected Time-windows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.5.1 Description of Processed Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.6 Data Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.6.1 Over Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.6.2 Undersampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4 Modeling 29
4.1 Knowledge Discovery Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.2 Modeling Techniques Assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.2.1 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.2.2 Linear and Quadratic Discriminant Analysis . . . . . . . . . . . . . . . . . . . . . . . 32
4.2.3 Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.2.4 K-Nearest Neighbours . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.2.5 Gaussian Naive Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.3 Ensemble learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.3.1 Parallel Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.3.2 Sequential Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.4 Gradient Boosting Frameworks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.4.1 Tree Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.4.2 Best split method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.4.3 Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.4.4 Hyperparameter tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.5 Feature selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.5.1 Recursive Feature Elimination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.5.2 Sequential Forward Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.5.3 Recursive Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.6 Model Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.6.1 Features Ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.6.2 SHapley Additive exPlanation values . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.7 Model Performance Assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.7.1 Repeated K-Fold Cross Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.7.2 Sensitivity, Specificity and Precision . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
xii
4.7.3 Area under the receiver operating characteristic curve . . . . . . . . . . . . . . . . . . 44
4.7.4 Area under the Precision-Recall curve . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5 Results 45
5.1 Descriptive Analysis of the Cohort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.2 Selection of a Machine Learning Technique . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.3 Selection of a Gradient Boosting Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.4 First-day Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.4.1 Hyperparameter tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.4.2 Effects of Sampling Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.4.3 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.4.4 Feature Selection - Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.4.5 Comparing Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.5 Comparison with severity scores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.6 Analysis for different time-windows. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.7 Comparison with similar studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.8 External Validation - eICU-CRD database . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
6 Model Analysis and Interpretation 63
6.1 Model Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
6.1.1 Glasgow Comma Scale, Age and Ventilation Time . . . . . . . . . . . . . . . . . . . . 65
6.1.2 Number of Insulin Infusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
6.1.3 Diabetes and Long-Term Insulin Users . . . . . . . . . . . . . . . . . . . . . . . . . . 67
6.1.4 Ethnicity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
6.1.5 Glucose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
6.1.6 Respiratory Rate and Respiratory Diseases . . . . . . . . . . . . . . . . . . . . . . . . 69
6.1.7 Anion Gap and Bicarbonate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
6.2 Individualized Clinical Dashboards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
7 Conclusions 77
7.1 Achievements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
7.2 Comparison with Previous Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
7.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
Bibliography 81
A Outlier Detection 90
B Gradient Boosting Machines 94
xiii
C eICU-CRD Collaborative Research Database 97
C.1 Inclusion Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
C.2 Input Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
C.3 Data Treatment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
D SHAP values 100
xiv
List of Tables
2.1 Types of Insulin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Blood glucose levels Classification(mg/dL) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3 Chronological summary of major randomized controlled trials. . . . . . . . . . . . . . . . . . 11
3.1 Demographic variables for the working cohort. . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2 Multi-level approach for patients diagoses . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.3 Covariates associated to patients’ diagnoses . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.4 Laboratorial variables and normal ranges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.5 Vital variables and normal ranges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.6 Number of patients included in each time-window. . . . . . . . . . . . . . . . . . . . . . . . 25
3.7 Description of the input variables. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.1 Machine learning algorithms assessed in this thesis work. . . . . . . . . . . . . . . . . . . . . 32
4.2 Hyperparameters to tune. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.1 Performance metrics for machine learning techniques . . . . . . . . . . . . . . . . . . . . . . 47
5.2 Performance metrics for each GB algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.3 Performance metrics for first-day dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.4 Performance metrics for oversampling techniques . . . . . . . . . . . . . . . . . . . . . . . . 51
5.5 Performance metrics for undersampling techniques . . . . . . . . . . . . . . . . . . . . . . . . 51
5.6 Input features after feature selection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.7 Performance metrics after feature selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.8 Performance metrics to comparison between approaches for LGB . . . . . . . . . . . . . . . . 59
5.9 Performance metrics to compare with common severity scores used in clinical setting. . . . . . 59
5.10 Performance metrics for different data extraction time-windows. . . . . . . . . . . . . . . . . 60
5.11 Performance comparison with literature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.12 Performance metrics for external validation with eICU database . . . . . . . . . . . . . . . . . 62
C.1 Number of patients included in each feature subset . . . . . . . . . . . . . . . . . . . . . . . 99
xv
xvi
List of Figures
1.1 Internet of Things . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Internet of Things - Analogy Human-Car . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Data Mining in databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1 Homeostasis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.1 Inclusion criteria applied to extract the cohort used in this work. . . . . . . . . . . . . . . . . 16
3.2 Patients percentage per variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.3 Outliers Detection - Glucose Feature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.4 Percentage patients loss and its relationship with missing date in some features. . . . . . . . . 23
3.5 Prediction time-windows switching for the study. . . . . . . . . . . . . . . . . . . . . . . . . 25
4.1 Model construction layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.2 Trade-off Select Machine Learning Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.3 Timeline of ensemble learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.4 Confusion Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.1 General overview of the working dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.2 Comparison of different machine learning algorithms (first-day) . . . . . . . . . . . . . . . . 47
5.3 Comparison of different machine learning algorithms (Last-day) . . . . . . . . . . . . . . . . 47
5.4 Performance analysis comparing GB algorithms (First day) . . . . . . . . . . . . . . . . . . . 48
5.5 Performance analysis comparing GB algorithms (Last day) . . . . . . . . . . . . . . . . . . . 48
5.6 Number of Estimators vs Learning Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.7 Number of leaves vs Max depth. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.8 Subsampling and feature sampling for LGB. . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.9 Recursive Feature Elimination - LGB with SHAP Values . . . . . . . . . . . . . . . . . . . . . 53
5.10 Sequential Forward Selection - LGB with AUC metric . . . . . . . . . . . . . . . . . . . . . . 53
5.11 Recursive Feature Selection - LGB with SHAP Values . . . . . . . . . . . . . . . . . . . . . . 54
5.12 Recursive Feature Elimination - LR with Weight Vectors . . . . . . . . . . . . . . . . . . . . 55
5.13 Sequential Forward Selection - LR with AUC metric . . . . . . . . . . . . . . . . . . . . . . . 56
5.14 Recursive Feature Selection - LR with Weigt Vectors . . . . . . . . . . . . . . . . . . . . . . 57
5.15 Analysis for different time-windows. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
xvii
6.1 Feature’s importance ranked through SHAP values. . . . . . . . . . . . . . . . . . . . . . . . 64
6.2 SHAP values for the 20 most important features. . . . . . . . . . . . . . . . . . . . . . . . . 64
6.3 SHAP ranking and the relationship to different covariates. . . . . . . . . . . . . . . . . . . . 66
6.4 Number of infusions and SHAP values for LightGBM and LDA models. . . . . . . . . . . . . 67
6.5 Diabetic condition and SHAP values for LGB and LR models. . . . . . . . . . . . . . . . . . 68
6.6 Ethnicity SHAP values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
6.7 Glucose readings vs SHAP values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
6.8 Respiratory Rate SHAP values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
6.9 Respiratory Rate (Respiratory Diseases) SHAP values . . . . . . . . . . . . . . . . . . . . . . 71
6.10 Anion gap and SHAP values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
6.11 Bicarbonate Mean SHAP values for LGB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
6.12 Patients mortality probability. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
6.13 Clinical dashboard for a patient expected to survive . . . . . . . . . . . . . . . . . . . . . . . 74
6.14 Clinical dashboard for a patient expected to die . . . . . . . . . . . . . . . . . . . . . . . . . 74
6.15 Patients mortality probability coloured by real outcomes . . . . . . . . . . . . . . . . . . . . . 75
A.1 Outliers Detection - Anion Gap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
A.2 Outliers Detection - Bicarbonate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
A.3 Outliers Detection - Chloride . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
A.4 Outliers Detection - Creatinine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
A.5 Outliers Detection - Hemoglobin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
A.6 Outliers Detection - Hematocrit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
A.7 Outliers Detection - MCH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
A.8 Outliers Detection - MCHC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
A.9 Outliers Detection - MCV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
A.10 Outliers Detection - Platelet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
A.11 Outliers Detection - RBC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
A.12 Outliers Detection - RDW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
A.13 Outliers Detection - Sodium . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
A.14 Outliers Detection - BUN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
A.15 Outliers Detection - Glucose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
C.1 Databases analogy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
C.2 Inclusion criteria applied to extract the cohort used in this work. . . . . . . . . . . . . . . . . 98
C.3 Missing values for each variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
xviii
Nomenclature
Acronyms
ADA AdaBoost
AUC Area Under ROC curve
AUPRC Area Under Precision-Recall curve
CB Categorical Boosting (CatBoost)
CCS Clinical Classifications Software
CIT Conventional Insulin Therapy
CRISP-DM Cross Industry Process for Data Mining
DIKW Data-Information-Knowledge-Wisdom
DT Decision Trees
ENN Edited Nearest Neighbours
GB Gradient Boosting
GNB Gaussian Naive Bayes
ICD-9-CM International Classification of Diseases, Ninth Revision, Clinical Modification
ICU Intensive Care Unit
IIT Intensive Insulin Therapy
IoT Internet of Things
KDD Knowledge Discovery in Databases
KNN K-Nearest Neighbour
LDA Linear Discriminant Analysis
LGB Light Gradient Boosting (LightGBM)
LOCF Last Observation Carried Forward
xix
LODS Logistic Organ Dysfunction System
MAR Missing at Random
MCAR Missing Completely at Random
MCHC Mean Corpuscular Hemoglobin Concentration
MCH Mean Corpuscular Hemoglobin
MCV Mean Corpuscular Volume
MIMIC III Medical Information Mart for Intensive Care III
MNAR Missing not as Random
NCR Neighborhood Cleaning Rule
OASIS Oxford Acute Severity of Illness Score
QDA Quadratic Discriminant Analysis
QR Quick Response
QSOFA Quick Sequential Organ Failure Assessment
RBC Red Blood Cells
RDW Red cell Distribution Width
RFID Radio-Frequency Identification
RFS Recursive Feature Selection
RF Random Forest
SAPS III Simplified Acute Physiology Score 3
SAPS Simplified Acute Physiology Score
SHAP SHapley Additive exPlanations
SOFA Sequential Organ Failure Assessment
SVM Support Vector Machines
WBC White Blood Cells
XGB Extreme Gradient Boosting (XGBoost)
xx
Chapter 1
Introduction
The world’s digitization has brought a multitude of ways to generate and collect data in the most diverse
contexts. Data is scattered everywhere. Everything is data. However, data alone is inherently powerless and
that’s where machines and current machine learning techniques make a differentiation. Ability to learn from
data changed the way machines are programmed and companies realized that investing to transform data into
information can support decision making processes.
Surely everyone has already faced with machine learning in everyday situations: weather forecasting in
the news, e-mail spam filters for unwanted information, alerts when logging in our personal accounts in a
new device or online retailers offering personalized recommendations based on previous purchases and activity.
Daily used apps are loaded with machine learning to give the consumer solid and empowered support. Apps use
image recognition to identify familiar faces from contact list, suggest movies and music tracks from likes and
dislikes, use voice recognition to imitate human interaction and analyze traffic to reduce travel time suggesting
faster routes. Even smart homes nowadays adjust indoor climate, switch and regulate lights or detect a leak
on a water pipe only by making data-driven decisions [1, 2].
The importance of data has grown so much in the past decade that nowadays is considered one of the
most important commodities and is giving rise to a new economy comparable to petroleum industry economy.
Data as a new currency is a thing and is considered the petroleum of the digital era [3].
But how and why is data changing this paradigm of society?
The technological advancements are constantly changing how society interact. Internet has been the
catalyst to people leverage connectivity to interact with each other and with multiple devices. The world is at
our fingertips and with just a click data is generated.
The way data is handled is what it makes it so valuable and machine learning makes computing processes
more efficient, cost-effective, reliable and personalized for the consumer. Despite that, machine learning only
works when data is shared which is a point that turns the data into a valuable asset in an economic perspective
of data sharing.
From computing devices to mechanical and digital machines passing through animals and people, at some
point everything will be connected and able to identify themselves in a data sharing environment, so called,
Internet of Things (IoT) (figure 1.2) tailed to transfer data from one another to optimize each one performance.
1
This assumption does not necessary means the communication method is strictly restricted to internet since
radio frequencies (RFID), sensor technologies or QR codes may be included to facilitate data flowing process
in and between ”things”. IoT is expected to produce a tremendous amount of data that can be boosted with
the help of machine learning. As more data becomes available more diverse and challenging problems can be
tackled [4–6].
Figure 1.1: Internet of Things
With the help of IoT and machine learning, it is possible to bridge the gap between medicine and mechanical
engineering to better understand the aim of this thesis. Thereby, an analogy between a car and a human is
made, keeping in mind the proper separation of ideas (figure 1.2).
Considering car’s engine the heart of a car and its fluids as blood, either a device monitoring heart signal
in a person or a sensor measuring engine performance in a car give data about the object in study from which
is possible to infer possible malfunctions.
In a thesis related practical example, a monitored control of blood glucose levels leads to a decision by
the physician in the presence of abnormal values (occasionally insulin is administered in the presence of high
values) in the same tune a sensor recording abnormal oil levels results in a action on the part of the mechanic.
The expected outcome is both levels return to their normal range of values but external constraints may change
the desired outcome like patient’s resistance to insulin or a leak in oils reservoir.
Thus, it is important to take in account more data to infer about the possible cause, i.e, collect more clinical
or mechanical information about the object in study which is possible accessing sensors and monitoring devices
embedded in the object. This sensors are connected to a gateway that analyze and sorts data transmitting
only valuable information to the platform of IoT.
IoT’s platform constantly gathers and stores data not only from the object in study but also data from
other similar objects in order to build a historical record and a database. Coupling machine learning with
the data collected in IoT’s platform leads to an identification of possible diseases/anomalies, suggesting a
treatment/repair to the physician/mechanic, supporting the decision-making process.
Estimate lifetime for each object from data is a more ambitious task which will be addressed in the case
2
(a) Analogy Human-Car (b) ’Things’ in Internet of Things
Figure 1.2: Internet of Things - Analogy Human-Car
of patients in intensive care units in the course of this master’s thesis.
After all, it all comes to handle the data no matter the topic and climb the data-information-knowledge-
wisdom (DIKW) hierarchy [7], where data leads to information, that is transformed in knowledge identifying
patterns but just with a formulation and an understanding of the principles can be turned in wisdom. In a
succinct way, the process of discovering useful information from data is called data mining.
1.1 Applications of Data Mining in Healthcare
Data mining is becoming increasingly essential in healthcare. Insurance companies attempting fraud detection,
diagnoses through the analysis of images and resources allocation control in hospitals based on predictive high
risk areas are examples of data mining application in healthcare [8, 9]. Notwithstanding, data mining is crucial
in data from intensive care units (ICUs), both by its high-dimensionality and the possibility to achieve major
outcomes in critically-ill patients [10].
Due to the critical condition of the patients, ICUs are the most data-intensity part of any hospital. Highly
sophisticated devices have been implemented to closely monitor patients’ health status and are capable to
collect electronic health records (EHR) in an extremely high frequency [11]. EHRs cover not only bedside
monitored vital signs or laboratory measures collected from devices but also patients’ demographic information,
past medical history and diagnoses notes, just to name a few, in order to provide a detailed critical record for
each patient. Having the ICU data in an electronic format led to the development of large datasets whose
latent information has the potential to improve health and disease understanding [10, 11].
However, there is no consensus framework to apply data mining in datasets. Knowledge Discovery in
Databases (KDD) process [12] and Cross Industry Process for Data Mining (CRISP-DM) [13] are widely used
frameworks in a variety of different projects and their incorporation for data mining application in healthcare
3
is viable. Despite ponctual differences, these frameworks are quite similar and can be merged into a single
framework in order to meet the demanding necessities of this thesis.
So, the problem can be branched into seven major steps that are described in figure 1.3. Predominantly
following the steps proposed in [13], an extra step for model interpretation was introduced as a result of the
requirement to understand model outputs, essential to achieve possible medical conclusions.
Project understanding, data understanding, data preparation, modelling using machine learning, evaluation,
model interpretation and deployment are the steps that will be discussed throughout the constituent chapters
of this thesis.
Figure 1.3: Data Mining in databases
1.2 Mortality in Patients under Insulin Therapy
Over the past years, glucose control among patients in ICUs has became a widely discussed and controversial
topic. Since a study conducted by van der Berghe [14] reported that critically-ill patients under intensive insulin
therapy (IIT), aiming to achieve normoglycemia, had lower mortality rates than those following the conventional
protocol that, several studies were made scheming a comparison between intensive and conventional insulin
therapy (section 2.2.3). After years as the standard care in ICU, intensive insulin therapy came to be seen as
counterproductive and conventional insulin therapy (CIT) establish as the standard practice [15].
Insulin therapy is intrinsically related to hyperglycemic events since it is the preferred regimen for effectively
healing hyperglycemia in ICU [16]. Hyperglycemia occurs in the majority of critically ill patients regardless
of previous history of diabetes and is associated with many adverse clinical outcomes [15, 17, 18], including
mortality and morbidity [19, 20]. The mortality rate for newly-hyperglycemic patients approaches one in three
4
[21] so it is consensual that hyperglycemia should be treated to improve chances of survival.
On the other hand, hypoglycemia is a limiting factor when dealing with the maintenance of blood glucose
levels [22]. Besides not directly related, hypoglycemic events can be linked with insulin infusion. Absolute
or relative insulin excess, together with inadequate nutrition and features of critical illness are the funda-
mental causes of hypoglycemia in ICUs [22, 23]. Like hyperglycemia, hypoglycemia is a concern during the
management of critically ill patients that is associated with unfavourable outcomes to the patients.
1.3 Objectives and Contributions
The aim of this thesis is to develop a model capable of predict mortality in patients under insulin therapy in
ICU using data from Medical Information Mart for Intensive Care (MIMIC) III database.
After model’s construction, it is intended to validate the same with data from eICU collaborative research
database from Philips Healthcare which, to the best of our knowledge, it will be one of the first works to
present results using the aforementioned database.
The work is mainly focused on patients’ first 24 hours in ICU since this is the time-stamp used to calculated
common severity scores used in ICUs. However miscellaneous data is taken from different time-windows during
patient’s ICU stay in a perspective of a real-time mortality prediction.
To the best of our knowledge, this is the first work to focus entirely on patients under insulin therapy.
Although there are several studies comparing the influence of different insulin therapies in ICUs (that were
identified in section 2.2.3), none of them is intended to predict mortality. On the other hand, studies aimed
to predict mortality, even not insulin-related, served for performance assessment. Compared to the highest
performance found in literature [24], a study with no restrictions in patients choice, the model has a similar
performance besides dealing with a less predictable cohort of patients.
Relatively to data preprocessing, this work is one of the only to rearrange diagnosis upon admission, more
precisely ICD9 codes, in a multi-level approach following the rules of Clinical Classifications Software (CCS)
[25].
In the modeling strand, several machine learning techniques were tested citing, k-nearest neighbours (KNN);
support vector machines (SVM); decision trees (DT); random forest (RF); logistic regression (LR); AdaBoost
(ADA); gradient boosting (GB); gaussian naive bayes (GNB); linear discriminant analysis (LDA) and quadratic
discriminant analysis (QDA).
LR achieved the second highest performance among all techniques and by norm, it is the baseline model
in health data analysis. For that reason, it was selected to keep the work.
GB achieved the highest performance among all techniques and three different gradient boosting frame-
works were, a posteriori, tested citing, XGBoost (XGB), LightGBM (LGB) and CatBoost (CB). Those three
achieved a higher performance than GB but similar performance between each other. Among those, LGB was
selected to further work for the study due to stand out in terms of computational performance.
Regarding feature selection, a novel approach was implemented entitled recursive feature selection (RFS),
using the importance of features calculated through SHapley Additive exPlanations (SHAP) values as the
ranking criterion for recursive feature elimination coupling principles from sequential selection methods. For
5
comparison, widely used sequential forward selection and recursive feature elimination were performed.
Moreover, this work underpins their real contributions not only in its predictive performance but especially
in the way the models are interpreted using SHAP values. Thus, variables that overall and individually most
affect patients are identified, giving rise to the construction of individualized clinical dashboards. These may
be an important tool in a perspective of data-aided decisions by physicians.
To highlight that, the work developed was the basis of the conference paper intitled ’Fuzzy Modeling of
Survival Associated to Insulin Therapy in the ICU’ submitted for FUZZ-IEEE 2019 - International Conference
on Fuzzy Systems.
1.4 Thesis Outline
This thesis is organized in 7 chapters. Following the present chapter, the 6 subsequent chapters are described
by topics as follows:
Chapter 2 - Insulin Therapy in Intensive Care Units
Theoretical framework of the influence of insulin on human body and its importance to control blood
glucose levels in intensive care units. Survey of studies comparing intensive and conventional insulin therapy.
Previous studies aiming mortality prediction. Severity scores in ICU.
Chapter 3 - Data Pre-Processing
MIMIC III database description. Patients and variables inclusion criteria. Variables description by type and
typical range of values. Missing data, outliers and max-min normalization. Time experiments. Data sampling.
Chapter 4 - Modelling
Problem Assessment. Machine learning techniques explanation. Model interpretation explanation. Feature
selection techniques. Performance Metrics.
Chapter 5 - Results
Overview of data for the study. Choice of machine learning techniques. Hyperparameter tuning. Results
after: data sampling and feature selection. Mortality prediction for different time-windows. External validation
using eICU-CRD database.
Chapter 6 - Model Analysis and Interpretation
Interpretation of the model. Individualized clinical dashboards.
Chapter 7 - Conclusion
Main conclusions achieved with this master’s thesis. Future work.
6
Chapter 2
Insulin Therapy
This chapter describes the role of insulin in human metabolism. Section 2.1 succinctly explains key concepts
such as homeostasis, how homeostatic regulation of blood glucose levels occur, which metabolic disorders are
associated with glucose levels, and which types of insulin are used to counteract these disorders. Section 2.2
describes insulin therapy in the ICU setting through making a brief review of the studies performed to compare
different types of insulin therapies for glucose control. Lastly, section 2.3 presents a brief state-of-the-art about
the current studies in mortality prediction and the severity scores used to evaluate patient health status in the
ICU.
2.1 Introduction
2.1.1 Homeostatic Regulation of Blood Glucose Levels
The human body is composed by trillions of cells that work together for the maintenance of the entire organism.
Cells make up tissues, tissues are grouped in organs and separated organs work together forming human body
systems with distinct functions. Maintaining an equilibrium and stability in response to environmental changes
is crucial for the welfare of an individual.
Homeostasis is the tendency of human body systems to maintain internal stability owing a coordinate
response to any stimulus that would tend to disturb normal functionality of the human body [28]. Figure 2.1
summarize human body systems and describe their role to homeostasis.
The response to a stimulus is characterized by a feedback regulation. In first instance, a stimulus is
detected by a receptor that send information to a control center which in turn send commands to an effector
to counteract stimulus. Response to stimulus may itself become a new stimulus and process is repeated until
reach a set point, resulting in homeostasis.
Regulating blood glucose concentration is part of homeostatic regulation [29]. Glucose is required for
cellular respiration and is the preferred fuel for all human body cells. An imbalance on normal blood glucose
levels work as a stimulus that is detected by pancreas. Pancreatic islets, namely the islets of Langerhans, are
islets dispersed throughout the pancreas that release regulatory hormones from different cells to counteract
stimulus. When blood glucose levels rises, insulin is released from β-cells into bloodstream to stimulate human
7
(a) Human body systems and role on homeostasis (source[26])
(b) Homeostatic regulation of blood glucose levels (source[27])
Figure 2.1: Homeostasis
body cells to take glucose up from the blood and liver to convert excess glucose into glycogen (glycogenesis).
In opposite, when blood glucose levels fall, α-cells release glucagon into bloodstream to stimulate liver to
break down stored glycogen and liberate glucose to the blood (glycogenolysis). In figure 2.1 is presented a
schematic picture of homeostatic regulation of blood glucose levels.
2.1.2 Diabetes Mellitus
Diabetes mellitus is a metabolic disorder that cause high blood glucose levels over a prolonged period due to
defects in insulin production or cells’ resistance to insulin. Individuals who carry this condition have problems
in homeostatic regulation when blood glucose levels rise and need special care or treatment with insulin. There
are two major types of diabetes mellitus, type I and type II.
In type I diabetic humans there is an absence of beta cells in pancreas killed mistakenly by immune system
so there is no insulin production. Glucose is not removed from bloodstream and an insulin-dependent treatment
is necessary to avoid death.
Type II diabetes is characterized by cells’ insensitivity or resistance to insulin. Due to the exposure to high
blood glucose levels for an extendend period and an overproduction of insulin, the human body adapts and
became ineffective at using insulin or simply unable to produce enough insulin. However, this is a non-insulin
dependent condition.
Secondary diabetes is a consequence of another medical condition and it’s taken as a response to hyper-
glycemia effects. It is important to mention that diabetes can appear in a wide range of types. For instance,
gestational diabetes, diabetes LADA, diabetes MODY, double diabetes, type III diabetes or diabetes insipidus
[30]. However, these types of diabetes are out of the main purpose of this work.
8
2.1.3 Types of Insulin
In a historical context, insulin was discovered in 1921 by Banting and Best in pancreatic extracts of dogs and,
with the help of MacLeod, it was possible to purify insulin for human needs. Banting and MacLeod were
awarded ”for the discovery of insulin” with the Nobel Prize in Physiology back in 1923 [31].
Initially extracted from animals pancreases, now human insulin is produced synthetically by growing genet-
ically engineered strains of bacteria, namely E. coli. However, insulin analogs are replacing synthetic human
insulin. Insulin analogs mimics better body’s natural pattern of insulin release and have a more predictable
duration of action [32].
Types of insulin are described in table 2.1 by their duration of action that differs in terms of onset, peak
and duration in bloodstream. Their role in blood sugar management is also explained.
Table 2.1: Types of Insulin [32].Type of Insulin(Brand Name)
Onset Peak Duration Role in Blood Sugar Management
Rapid-Acting
Lispro(Humalog)*
15-30 min 30-90 min 3-5 h Covers insulin needs for meals eatenat the same time as the injetion.
Often used with longer-acting insulinAspart
(Novolog)10-20 min 40-50 min 3-5 h
Glulisine(Apidra)
20-30 min 30-90 min 1-2.5 h
Short-Acting
Regular*(Humulin R,Novolin R)
30-60 min 2-5 h 5-8 h Covers insulin needs for mealseaten within 30-60 min
Insulin Pump(Velosulin)
30-60 min 1-2 h 2-3 h
Intermediate-Acting
NPH*(Humulin N,Novolin N)
1-2 h 4-12 h 18-24 h
Covers insulin needs forhalf the day or overnight.
Often combined with rapid orshort-acting insulin type.
Long-Acting
Glargine*(Lantus)
1-1.5 h No peak 20-24 h Covers insulin needs for one full dayOften combined with rapid or
short-acting insulin.Detemir
(Levemir)1-2 h 6-8 h Up to 24 h
Degludec(Tresiba)
30-90 min No peak 42 h
Pre-Mixed
NPH + Regular(Humulin 70/30)
30 min 2-4 h 14-24 h Taken two or three timesa day before mealtime.
Combine intermediate andshort-acting insulin.
Note: Factors in first column indicatepercentage of each type of insulin
(Intermediate/Short)
NPH + Regular(Novolin 70/30)
30 min 2-12 h Up to 24 h
Aspart Protamine + Aspart(Novolog 70/30)
10-20 min 1-4 h Up to 24 h
NPH + Regular(Humulin 50/50)
30 min 2-5 h 18-24 h
Lispro Protamine + Lispro(Humalog mix 75/25)*
15 min 30-150 min 16-20 h
*Types of insulin presents in MIMIC III database
9
2.2 Insulin Therapy in Intensive Care Units
2.2.1 Intensive Care Unit
Intensive care, also know as critical care, is a medical specialty dedicated to the comprehensive management
of patients having, or being at risk of developing, acute, life-threatening injuries and illnesses [33].
An intensive care unit (ICU) is an organized system for the provision of care to critically ill patients that
provides intensive and specialized medical and nursing care, an enhanced capacity for monitoring, and multiple
modalities of physiologic organ support to sustain life during a period of acute organ system insufficiency [33].
2.2.2 Dysglycemia
Dysglycemia is a term that refers to any disorder of blood sugar metabolism and can appear in two different
types: hyperglycemia and hypoglycemia. As introduced in section 1.2, these episodes frequently occur in
critically hospitalized patients and it is associated with adverse outcomes, including morbidity and mortality
[19, 20].
Hypoglycemia is a condition when the amount of circulating glucose in the bloodstream is lower than
normal; while higher amount than normal refers to hyperglycemia. In table 2.2 is shown how are classified
blood glucose levels to describe each dysglycemic condition.
Hyperglycemia, besides being commonly associated to patients with a medical history of diabetes, it
frequently appears as stress hyperglycemia [34, 35] which is an adaptive and appropriate response to life-
threatening condition in previously normoglycemic patients [36].
Table 2.2: Blood glucose levels classification (mg/dL) [37],[38].
HypoglycemiaLevel 3
HypoglycemiaLevel 2
HypoglycemiaLevel 1
Normal Hyperglycemia
< 50 [50− 54[ [54− 70[ [70− 180] > 180
2.2.3 Intensive Insulin Therapy vs Conventional Insulin Therapy
The concept of controlling glycemia (normoglycemia) in ICU patients through insulin therapy to affect out-
comes has become increasingly complicated to apply and achieve. The debate between intensive insulin therapy
(IIT) and conventional insulin therapy (CIT) has been object of study over the years.
The first large study was the Diabetes Mellitus Insulin Glucose Infusion in Acute Myocardial Infarction
(DIGAMI) study [39] in 1995 where 620 patients with diabetes mellitus and myocardial infarction were assigned
to receive conventional therapy or intensive therapy with a glucose-insulin infusion. In the trial, an improved
long-term prognosis was achieved in patients with intensive therapy.
In 2001, Leuven surgical trial conducted by Van der Berghe [40] resulted in a substantially reduced mortality
and morbidity in a surgical ICU with the use of IIT to maintain glucose levels.
Five years later, Van der Berghe conducted the Leuven medical trial [41] to see the role of IIT but for
patients in a medical ICU. IIT prevented morbidity but did not significantly reduce the risk of death among
10
all patients. However those who stayed in the ICU for three or more days presented reduced morbidity and
mortality.
The larger study until the date was conducted in 2009, the Normoglycemia in Intensive Care Evaluation-
Survival Using Glucose Algorithm Regulation (NICE-SUGAR) trial [42] with 6140 patients from medical and
surgical ICUs contradicted the previous studies because patients assigned to IIT had an higher mortality and
an higher incidence of hypoglycemic events comparing to conventional control.
The Glucontrol study [43] published by Preiser in 2009, randomized medical and surgical ICU patients. IIT
didn’t reduce mortality and the risk of hypoglycemia was increased. The study was prematurely interrupted
which precluded definitive conclusions to be drawn.
One of the last published studies, was [44] in 2014 with a different approach. Instead of traditional IIT,
patients underwent tight computerized glucose control. Comparing to conventional control, mortality didn’t
significantly change but was associated with more frequent severe hypoglycemic episodes.
Last published studies [42–44], demonstrate that IIT is associated to a higher frequency of hypoglycemic
events which is the major consequence for nowadays CIT be the standard practice in ICU.
A chronological summary of major randomized controlled trials is presented in table 2.3. It is presented
the number of patients associated to each trial, the type of center where the trials were conducted and if the
trial was in a single or multi centers. For last, the range of values for each type of insulin therapy in the trials
were identified.
Table 2.3: Chronological summary of major randomized controlled trials.
Year Clinical TrialNo.
PatientsCenter Type of Center IIT (mg/dL) CIT (mg/dL)
1995Malmberg -
DIGAMI [39]620 Multi Acute Myorcadial Infarction 126-180 NA
2001Berghe -
Leuven [40]1548 Single Surgical ICU 80-110 180-200
2006Berghe -
Leuven 2[41]1200 Single Medical ICU 80-110 180-200
2007 Gandhi [45] 620 Single Surgical ICU 80-100 ¡200
2008 Arabi [46] 523 Single Mixed ICU 80-110 180-200
2008 De la Rosa [47] 504 Single Mixed ICU 80-110 180-200
2008 Brunkhorst [48] 537 Multi Mixed ICU 80-110 180-200
2008 Mackenzie [49] 240 Multi Mixed ICU 72-108 198
2009NICE-SUGAR
trial [42]6104 Multi Mixed ICU 81-108 144-180
2009Preiser -
Glucontrol [43]1078 Multi Mixed ICU 79-110 140-180
2009 Yang [50] 240 Multi Neurological ICU 80-110 <200
2009 Bilotta [51] 483 Single Neurosurgical ICU 80-110 <215
2010Annanne -
COIITSS [52]509 Multi Septic Shock Patients 80-110 180-200
2012 Desai [53] 189 Single Surgical ICU 90-120 121-180
2013 Giakoumidakis [54] 212 Single Surgical ICU 120-160 161-180
2014 Macrae [55] 1369 Multi Mixed pediatric ICU 72-126 180-216
2014 Kalfon [44] 2648 Multi Mixed ICU 79-109 ¡180
11
2.3 Mortality Prediction in Intensive Care Units
2.3.1 Previous studies aiming mortality prediction in ICU
To date, no published studies were conducted restricting patients’ cohort to only patients under insulin therapy.
Since insulin is highly related with diabetic and hyper/hypoglycemic events, studies addressing these topics
were prioritized but even these topic-related studies were scarce.
Three studies predicting mortality in patients within ICU stay were analyzed for comparison purposes.
These studies have in common the use of the same database as this thesis, MIMIC III [56].
In a recent published study [57], a cohort of 4111 diabetic patients with a mortality rate of 9.3%, used a
random forest and a combined logistic regression between three severity scores (CCI, DCSI and Elixhauster)
to achieve AUC values of 78.7 and 78.5 respectively. Mean blood glucose was the variable most strongly
associated with mortality in diabetic patients. Among all variables, from just five variables (diagnoses at
admission, type of admission, patient’s glycated hemoglobin, age and mean glucose), a robust classification
can be done to predict risk of mortality.
A different approach was applied in [58], where only patients that stayed in coronary care unit (CCU) were
analyzed using only heart rate signals to predict mortality. Heart rate is a time-series feature and was described
in terms of 12 statistical and signal-based features from the first hour in ICU. The cohort was composed by
2979 patients and 8 different classifiers were employed. Random forest and decision tree classifiers had the
best results with sensitivities of 97% and 92% and precisions of 97% and 90% respectively.
In a real-time mortality prediction attempt, [59] extracted data from a random time during ICU stay. Results
of first 24 hours in ICU were also recorded to comparison with common severity scores. Best performing model
was gradient boosting from four models tested. To the real-time experiment, gradient boosting had an AUC
of 92.0 and for the first 24 hours experiment the AUC was 92.7. It should also be noted, that developed model
outperformed severity scores in terms of predict mortality. The final cohort had 50488 ICU stays from this
database.
Some nuances from these three studies described above were applied to define some of the main objectives
of this thesis:
• Interpretation of which features are associated with mortality/survival in patients under insulin therapy.
• Describe time-series features in terms of statistical features.
• Evaluation of diverse models to compare their performance.
• Extracting data from different times within ICU admission.
• Comparison of results with common severity scores.
12
2.3.2 Severity Scores
In section 2.3.1, severity scores were used by the researchers as a comparison term against the developed
models. These severity scores are used in the ICU in order to aid physicians to predict patient outcome and
assess trauma severity. There are several severity scoring systems, but five scores will be described and used
in this work.
LODS
Logistic Organ Dysfunction System (LODS) is a way to assess organ dysfunction using 12 variables to rep-
resent the function of six organ systems (neurologic, cardiovascular, renal, pulmonary, hematologic, hepatic).
Variables are scored from 0 (no dysfuntion) to 5 (maximum dysfunction) based on the worst value recorded
in the first 24 hours in ICU.
SAPS
Simplified Acute Physiology Score (SAPS) use 14 physiological variables and their degree of deviation from
normal to assign a score from the first 24 hours of ICU stay.
SAPS II
Simplified Acute Physiology Score 2 (SAPS II) is an upgrade to SAPS. Assess only 12 physiological variables
in the first 24 hours of ICU admission
SOFA
Sequential Organ Failure Assessment (SOFA) scores the worst value of each day in ICU in a range from 0
(low) to 4 (high) representing malfunction of six organ systems (respiratory, cardiovascular, renal, hepatic,
central nervous and coagulation), for a total of 10 variables. Since SOFA score changes over time, just first
day values will be used in the study for comparison with the remaining severity scores.
QSOFA
Quick SOFA identify patients with suspected infection since mortality increase among infected patients. It is
a simplified version of SOFA score that take in account just three variables: blood pressure, respiratory rate
and glasgow comma scale. The score for each variable ranges from 0 to 3 and can be easily measured by
physicians while SOFA requires more laboratory tests.
13
14
Chapter 3
Data pre-processing
This chapter explains how it was obtained the final input dataset and how was defined the cohort used in
this work. Section 3.1 describes briefly the database used for this study. Section 3.2 starts with patients
inclusion criteria followed by input variables description. Then, data preparation is conducted in sections
3.3 and 3.4 with outlier detection, handling of missing data, time-series variables discretization and variables
normalization. An explanation of selected time-windows and an analysis to datasets coming from them are
conducted in section 3.5. Lastly, data sampling tecnhiques to be applied are presented in section 3.6.
3.1 MIMIC-III Database
Medical Information Mart for Intensive Care (MIMIC) III [56] is a large clinical database containing health-
related data from patients that were admitted in critical care units of the Beth Israel Deaconess Medical Center
between 2001 and 2012.
Developed by the MIT Laboratory for Computational Physiology, this database includes information such
as demographics, vital sign measurements made at the bedside (∼ 1 data point per hour), laboratory test
results, procedures, medications, nurse and physician notes, imaging reports, and out-of-hospital mortality.
Data is de-identified to not compromise the confidentiality and safety of patients. For this purpose,
patients are identified by codes: subject id to identify a patient, hadm id refers to each hospital admission
and icustay id to admission into ICU. Dates are shifted into the future by a random offset and the age of
individuals over 89 years old were changed to values greater than 300 years old.
MIMIC III database is extracted from two different systems: CareVue and Metavision. Patients in CareVue
system are admitted between 2001 − 2008 and admissions at a later date are recorded by Metavision. Both
systems have data archived in a different format which can lead to inconsistencies. The most prominent is
related to Item ID since several concepts are identified in multiple manners along database.
15
3.2 Patients and Variables Selection
3.2.1 Inclusion Criteria
Initially, all patients were extracted to analyze the number of admissions in the hospital per patient, and for
each admission, the number of ICU stays per admission. Readmissions were discarded, i.e., from patients with
multiple admissions or with more than one ICU stay per admission, only first admission and ICU stay were
included to avoid biased assessments.
From the subset, adult (≥ 16 years old) patients that received insulin during ICU stay were selected.
Infants were discarded because they have a different metabolism and, therefore, a different glucose control
protocol in the ICU. Lastly, only patients with a length of stay equal or higher than 24 hours remained for the
study.
The number of patients extracted in each step is described in figure 3.1. Cohort prior data treatment and
modeling is composed by 12338 patients.
61501 patients’ ICU stays in database
57328 patients’ first ICUstay during admission
46428 patients’ first hospital admission
13 195 patients that received in-sulin during ICU stay and age ≥ 16
8 044 patients 5 151 patientsPatients that receivedinsulin during ICU stay
Length of stay ≥ 24 h
12 338 Patients
MetaVision Carevue
Figure 3.1: Inclusion criteria applied to extract the cohort used in this work.
16
3.2.2 Input Variables
Patients’ information were divided into four major categories: demographic, diagnoses, laboratorial and vital
variables.
Demographic variables were extracted from Admission, Icustays and Patients tables.
Patients’ diagnoses with their respective ICD9 codes were extracted from Diagnoses icd table. ICD9 codes
meanings are described in D icd diagnoses table. Ventilation related data were extracted from Cpt events
and Chartevents tables. Inputevents mv and Inputevents cv tables served as support to extract insulin
infusions in patients during ICU stay.
Clinical measurements are both present in Chartevents and Labevents tables with duplicated values in
some instances. Labevents was chosen as the ground truth (as suggested in [56]) to laboratory variables
with Chartevents providing the vital variables. Codes associated to each measurement are identified in
D lab items and D items tables for Labevents and Chartevents respectively.
Demographic variables
Demographic variables are attributes of human population that constitute the study and are presented in table
3.1 along with their type and abbreviation name.
As mention in section 3.1, the age of patients older than 89 years old was changed to values greater than
300 years old. Since the median age of those patients is 91.4, each patients’ age in that situation was rounded
to 91.
For admission type in the ICU, emergency and urgent patients were merged into a single variable being
the elective variable attributed to patients with previously planned hospital admission.
Ethnicity was branched into five categories [Asian,Black,Hispanic,White and Other] based on the most
common categories present in dataset. The last category [Other] belongs to all the patients not attributed in
the first four categories.
Patient’s weight is recorded over-time though in an inconsistent way. Thus, only the first recorded value
is taken into account. Height is also extracted following the same method and body mass index (BMI) is
calculated through the formula: weightheight2
. Gender and length of stay are the remaining demographic variables
added to the list.
Table 3.1: Demographic variables.
# Demographics Type Abbreviation
1 Age Continuous age
2 Gender Categorical/Binary gender
3 Ethnicity Categorical ethnicity
4 Admission Type Categorical/Binary admission type
5 Length of Stay Continuous los icu
6 Weight Continuous weight first
7 Height Continuous height
8 BMI Continuous bmi
17
Diagnoses variables
The International Classification of Diseases, Ninth Revision, Clinical Modification (ICD-9-CM) codes were
extracted for the diagnoses variables. ICD-9-CM is a standard set of alphanumeric codes used to describe
patients’ symptoms and diagnoses. There are 6985 different ICD9 codes in MIMIC III database and taking
into account all the codes would be computationally unfeasible.
In order to reduce the number of variables, codes were grouped and arranged in a multi-level approach
following the rules of Clinical Classifications Software (CCS) [25]. From level to level the groups are divided
into multiple smaller and more descriptive groups for a total of four levels. First level is constituted by 18
groups, second level by 136 groups and third level by 367 groups. Fourth level serve as a more descriptive
level for just some groups of third level and is constituted by 209 groups. Notice that, there are a total of
15072 ICD9 codes, where only 6985 are in MIMIC III.
In table 3.2 is shown an example how codes are rearranged for the specific case of patient with admission
number, hadm id= 1506814.
Table 3.2: Multi-level approach for patients diagoses (hadm id= 1506814)
hadm id Level 1 Level 2 Level 3 Level 4MIMIC IIIdescription
ICD9code
156814
Endocrine, nutritionaland metabolic disorders
Diabetes mellituswith complications
Diabetes with ophthalmicmanifestations
-
Diabetes withophtalmic manifestations,
type I [juvenile type]not stated as uncontrolled
25051
Diseases ofnervous system
and sense organsEye disorders
Retinal detachments,defects vascular occlusion
and retinopathyOther retinal disorders
Background diabeticretinopathy
36201
Diseases ofcirculatory system
Hypertension Essential hypertension -Unspecified essential
hypertension4019
Diseases ofcirculatory system
Cerebrovascular diseaseAcute cerebrovascular
diseaseIntracranial hemorrhage Intracerebral hemorrhage 431
Diseases ofgenitourinary system
Diseases ofurinary system
Urinary tract infectionsUrinary tract infections;
site not specifiedUrinary tract infections;
site not specified431
First level was used to extracted codes related with diseases in human body systems. Such as, digestive
system (1), circulatory system (2), respiratory system (3), nervous system and sense organs (4), musculoskeletal
system and connective tissue (5), genitourinary system (6) and skin and subcutaneous tissue (integumentary
system) (7). Also, diseases of the blood and blood-forming organs (8), mental illness (9) and infectious and
parasitic diseases (10) were extracted. Each diagnosis numbered [1−10] above makes a different variable that
sums the number of ICD9 codes associated to each patient for the specific diagnosis.
Codes associated to diabetes were identified from second level since in first level are assigned to endocrine,
nutritional and metabolic disorders. Codes were split into diabetes mellitus type I or type II and secondary
diabetes (see table 3.2). From identified diabetic patients, was also verified whether or not those were long-
term insulin users.
Some patients needed ventilation during their stay in the ICU. It was calculated the total duration in hours
and how many different times a patient was under external ventilation. Lastly, Glasgow comma scale and
number of insulin infusions (attending that this number can be related to changes in insulin infusion rate)
were also extracted. In table 3.3 is detailed the diagnoses variables included in this work.
18
Table 3.3: Covariates associated to patients’ diagnoses.
Diagnoses Type Abbreviation
Digestive system (1)
Continuous
digestive sys
Circulatory system (2) circulatory sys
Respiratory system (3) respiratory sys
Nervous system and sense organs (4) nervous sys
Musculoskeletal system and connective system (5) musculoskeletal sys
Genitourinary system (6) genitourinaty sys
Integumentary system (7) skin sys
Diseases of the blood and blood-forming organs (8) blodd sys
Mental illness (9) mental sys
Infestous and parasitic diseases(10) infetous sys
Glasgow comma scale Discrete gcs
Number of times ventilated Discrete ventilation num
Time under ventilation Discrete ventilation time
Diabetes Type I Binary diabetes typeI
Diabetes Type II Binary diabetes typeII
Secondary diabetes Binary diabetes sec
Long-Term Insulin User Binary user insulin
Number of insulin infusions Discrete num infusions
Regarding laboratory and vital signs variables, measurements of the same variable are collected over time.
Data is stored over several codes but same or similar description (homonyms). That means the same mea-
surement may have different syntax (e.g. systolic blood pressure has 6 different codes associated).
Laboratory variables
It was extracted the percentage of patients that during ICU stay had at least one measure in the first 24 hours
for each variable. Due to the importance of glucose for the study, was set up as baseline variable to include
or exclude variables. In figure 3.2 it is clear that glucose is the most common variable (X axis). In an initial
approach to verify a trade-off between number of patients and number of variables, this histogram was divided
in three subsets representing the percentage {70%, 80%, 90%} of the information available per each variable
(horizontal dotted lines). Then, it was counted the percentage of patients (Y axis) that could be included
in each subset. Here those subjects with at least one missing variable are removed and not included in the
subset.
It is noticeable that there is a large loss of patients between 90% and 80%. At 80% the loss represents
almost 40% of total number of patients. For this reason, laboratorial variables chosen for the study were those
in 90% subset
Considering an absence of a universal range for normal values for each variable, a standard interval was
deduced in a conservative manner through a research in specialized entities [60–62]. The criteria was to keep
the minimum values found in the literature for lower limits and maximums for upper limits. In table 3.4 is
summarized all laboratory variables and normal ranges associated to each one.
19
% NP NV
90 11876 1780 8196 2270 7983 26
Figure 3.2: At left side, an histogram with percentage of patients per laboratory variable and the subsetschosen represented. At right side, a table with the total number of patients (NP) and number of variablesassociated (NV) for each subset.
Table 3.4: Laboratorial variables and normal ranges [60–62].
# Variable (units) Range Abbreviation # Variable (units) Range Abbreviation
1Anion Gap(mEq/L)
[7− 20] aniongap 9MCV - Meancorpuscularvolume (fL)
[80− 98] mcv
2 Bicarbonate (mEq/L) [23− 28] bicarbonate 10Platelet Count(K/uL)
[150− 450] platelet
3 Chloride (mEq/L) [96− 108] chloride 11 Potassium (mEq/L) [3.5− 5.1] potassium
4 Creatinine (mg/dL) [0.4− 1.3] creatinine 12RBC - Red bloodcells (m/uL)
[4.2− 5.6] rbc
5 Hemoglobin (g/dL) [12− 18] hemoglobin 13RDW - Red celldistributionwidth (%)
[9− 14.5] rdw
6 Hematocrit (%) [37− 52] hematocrit 14 Sodium (mEq/L) [134− 145] sodium
7MCH - Meancorpuscularhemoglobin (pg)
[28− 32] mch 15Urea Nitrogen(mg/dL)
[6− 25] bun
8
MCHC - Meancorpuscularhemoglobinconcentration(%)
[33− 36] mchc 16WBC - Whiteblood cells (K/uL)
[5− 10] wbc
17 Glucose (mg/dL) [70− 110] glucose
Vital variables
Vital variables collects patients’ vital signs during their ICU stay. Heart rate, respiratory rate, both systolic and
diastolic blood pressure, mean arterial pressure, peripheral oxygen saturation, temperature and urine output
are the variables were included. In table 3.5 is summarized the vital variables and their respective normal
ranges.
20
Table 3.5: Vital variables and normal ranges [60–62]
# Variable (units) Range Abbreviation # Variable (units) Range Abbreviation
1 Heart Rate (bpm) [60− 100] heartrate 5Mean ArterialPressure (mmHg)
[70− 110] meanbp
2Respiratory Rate(breathes per min)
[12− 16] resprate 6Peripheral OxygenSaturation (%)
[95− 100] spO2
3Systolic BloodPressure (mmHg)
[90− 120] sysbp 7 Temperature (oC) [36− 37] tempc
4Diastolic BloodPressure (mmHg)
[60− 80] diasp 8 Urine output (mL/h) [30] urineoutput
3.3 Data Preparation
Real-world datasets are generally incomplete, noisy and inconsistent. Medical datasets, especialy ICU related,
represent a bigger challenge than conventional datasets because exhibit unique features coming from patients
that due to their critical condition, their respective values might be abnormal [63]. Nevertheless, measurements
outside of the normal range can come from systematic errors caused by equipment malfunction or random
errors due to human mistakes when recording the data. Distinct measurements in a medical dataset must be
distinguished between abnormal but probable values and outliers for physiologically impossible or less probable
to read that value. Therefore, it is important to filter the information to differentiate between them.
On the other hand, missing data can occur when no data is stored or recorded for a variable during an
observation and/or result recording. Handling missing data is essential for not producing biased conclusions
that may induce to invalid results.
3.3.1 Removal of Outliers
Common pratices to remove outliers/identify novelties in medical datasets are based on statistical methods
such as interquartile range method (IQR) and Tukey’s method which uses IQR approach in a conservative
way, z-score for a standard deviation approach or local outlier factor (LOF) through a local density deviation.
Despite the efficiency of these methods, clinical knowledge in conjunction with a careful inspection of each
variable is more time-consuming but in return a better outcome is expected. This approach was applied to
the selected variables.
In figure 3.3 is shown an example of detection of outliers for variable aniongap. Measurements associated
to each patient are plotted in a graph to have a perception of the underlying distribution present in the variable
and box plot is used to graphically represent groups of variable data. Values inside of normal range are delimited
by a red dashed line in order to check for patterns in data. Through a careful inspection, an inclusion boundary
is set (green dashed lines) and values outside this boundary, considered outliers, are removed.
Outliers detection for the remaining variables are presented in Appendix A.
21
Figure 3.3: Outliers detection for anion gap variable
3.3.2 Missing data
Relative few absent observations on distinct variables can shrink the dataset sample size on a large scale.
Missing data is either missing completely at random (MCAR), missing at random (MAR) or missing not as
random (MNAR) [64, 65].
If missing data is randomly distributed across all observations, is considered MCAR. Usually appears due
to equipment malfunction, lost samples in transit or technically unsatisfactory samples [64].
MAR is a more realistic assumption than MCAR because observed data is no longer a random sample and
in addition to following a pattern, missing data is correlated to a set of observed variables. MAR may appear
in laboratory variables for patients who are more severely injured due to clinical staff being busier providing
time-sensitive care, skipping results annotation [66].
MNAR are commonly complicated and causes of missingness are unlikely to conclude. Normally are
associated to study dropout and patient’s’ illness or refusal to contribute. Demographic variables missing data
can be inserted in this group.
Handling of missing data is attenuated by a well-planned and careful data extraction for the study. If
necessary, conventional methods or statistical methods can be coupled to treat missing data.
Listwise or complete case deletion is the most used and simplest method where if a specific case have
missing data for any variable or a variable have several missing cases, exclusion criteria is applied. This
method was applied in section 3.2.2 in an initial approach to select laboratory and vital variables and will be
applied in section 3.5.1 to achieve the final input dataset for the study.
Imputation methods replace missing values for a reasonable guess. Mean substitution, regression imputa-
tion or last observation carried forward (LOCF) are practical procedures but not useful for laboratory and vital
variables since a time-series discretization will be applied (section 3.4.1).
An intuitive zero-imputation is implemented in diagnoses variables’ missing data besides the lack of records
didn’t necessarily mean that a patient wasn’t under determined condition. For example, scarcity in ventilation
related data includes about 900 patients and an intentional zero-imputation implies that those patients weren’t
under any type of ventilation.
The first 24h after admission time-window will be used to obtain the final variable set. For this analysis, the
22
graph presented in figure 3.4 shows the 7 features (X axis) with the highest number of missing samples (left Y
axis) ordered from left to the right. The green line shows the amount of missing data per each variable. The
blue line represents the total percentage of remaining patients in case a variable in the X axis were removed.
If BMI, height and temperature were excluded, there is a gain in the amount of patients included. Sliding to
the right side present a less significant gain in percentage of patients. For that reason, the three variables
aforementioned were excluded from now on.
Figure 3.4: Percentage patients loss and its relationship with missing date in some features.
3.3.3 Normalization
Data from different variables come in different orders of magnitude. All input data was normalized to mitigate
the potential bias of one variable with large numeric values dominating other variables having smaller values.
Min-max normalization (equation 3.1) was applied to normalize the values (x) of each feature (i) in an interval
[0, 1]. Normalization is not necessary for categorical and binary variables.
xi =xi −min(xi)
max(xi)−min(xi)(3.1)
3.4 Feature Construction
Feature construction is the process of augmenting the space of variables by inferring or creating additional
variables [67]. Feature construction methods may be applied to improve prediction performance and allow
easy addition of domain knowledge. Thus, it is important to generate a set of variables that are generalizeable
to different classifiers. [68]
3.4.1 Discretization of Time-series Variables
Laboratory and vital variables are recorded over time in a pace of approximately one measurement per hour.
23
Quantitative features have been calculated to represent those variables so, each variable is represented
in terms of six statistical features. These are maximum, minimum, mean, median, standard deviation and
variance. Moreover, one additional feature that counts the number of abnormal laboratory measurements, i.e,
values recorded outside the normal ranges (see table 3.4), was included to each variable, except for glucose
variable where more features detailing the abnormal measures were constructed due to its relevance for the
study.
3.4.2 Glycemic Covariates
In addition to statistical features, it was worked out more covariates from glucose information. These are repre-
sented in four features: count of hypoglycemic events for each hypoglycemia level and count of hyperglycemic
events according to values presented in table 2.2.
3.4.3 Categorical Variables
Categorical variables are split into multiple variables using one hot encoding. One hot encoding is used instead
of label encoding to avoid models assuming that variables have some kind of hierarchy, when clearly don’t
have (e.g. nominal features).
According to the different n labels present in each categorical variable, n binary variables are constructed.
This is done for ethnicity (nominal variable) resulting in 5 binary variables. In case n = 2 (binary variables),
as for gender and admission type, just one of the binary variables constructed prevail avoiding redundancy or
repetition.
3.5 Selected Time-windows
Sensitivity analysis in this work were performed independently in four different time-windows: predict mortality
during the first 12 hours after admission in ICU, first 24 hours after admission in ICU , 12 hours before ICU
discharge and 24 hours before ICU discharge. The cohort size for each time-window is highly dependent from
num infusions variable. This is because only those patients already under insulin therapy during a specified
time-window will be considered. Hence, it will be neglected those who received insulin outside the specified
time-window.
The study will mainly focus on the first 24 hours after admission to compare the results with common
severity scores, which by rule are calculated in the same time-window (section 2.3.2). It was excluded the
variable length of stay in ICU (los icu) since the values will be the same for all patients for the same
time-window.
Some data associated to a specific patient have the probability to appear and remain the same in two
or more time-windows. This is due to the fact that different time-windows may overlap. For example, a
patient that stayed in the ICU only for 24 hours, the value of any variable in the first 12 hours after admission
time-window or before discharge will be exactly the same.
In figure 3.5 is shown how time-stamps can switch between each other depending on the duration of stay
after being admitted in the ICU (green solid line). Time-stamps after admission are static (blue solid lines),
24
but not true for discharge times (black lines). Time-stamps (black continuous lines) before discharge for stays
longer than 48 hours (red solid line) appear after 12 and 24 time-stamps after admission and for stays shorter
that 48h may appear before times after admission (dashed black lines).
In any case, this switching between time-stamps has little influence since each trial will be carried out
separately.
Figure 3.5: Prediction time-windows switching for the study.
3.5.1 Description of Processed Datasets
The constitution of each time-window to be analyzed is presented in table 3.6. As expected, the number of
available patients rise as the ICU discharge time-stamp gets closer. Mortality ratio also increases. Further, in
all time-windows, there is a notorious imbalance between mortality and survival.
Table 3.6: Number of patients included in each time-window.Patients
Under Insulin TherapyPatients after
missing data removalDied/Survived
(Mortality ratio)
12h after admission 7626 7100 541/6559 (0.076)
24h after admission 9643 9098 826/8272 (0.091)
24h before discharge 11932 9593 1270/8323 (0.132)
12h before discharge 11956 11435 1353/10082 (0.118)
All ICU stay 12338 11788 1377/10041 (0.114)
To conclude this section, in table 3.7 is summarized the total 188 input variables used in this thesis work.
3.6 Data Sampling
Imbalance of classes is a recurrent problem found in real-world datasets, where the instances of dataset
predominantly belong to one class. Among others, it is a problem extremely common in medical datasets,
25
especially in mortality prediction [24, 57, 58]. The final datasets detailed in section 3.5.1 has this characteristic
as well.
There are two approaches to deal with class imbalance: cost functions and sampling techniques. Sampling
approach can be divided into three categories: oversampling, undersampling and hybrid, that is a mix between
over and undersampling . For this study, the first two categories of sampling techniques will be used to adjust
the dataset class distribution.
3.6.1 Over Sampling
Oversampling focus on the minority class to overcome the imbalance problem. Some of the most used
techniques are described below.
• Random Over Sampling. Random over sampling simply pick samples at random with replacement
from the minority class.
• Synthetic Minority Over-sampling Technique (SMOTE). It was proposed in [69] where new minority
instances are synthesized. In general, SMOTE takes one real minority sample and k closest minority
class neighbours. At each iteration one of those k neighbours is chosen and a new minority sample is
synthesized between the minority sample and that neighbour. This process is repeated until a balance
between minority and majority classes is achieved.
3.6.2 Undersampling
Undersampling focuses on the majority class to overcome the imbalance problem. Some of the most used
methods are:
• Random Under Sampling. Random under sampling randomly pick and eliminate samples from majority
class adjusting dataset class distribution.
• Tomek Links. The method [70] starts by calculating the distance between samples in a dataset. If two
samples from different classes are the nearest neighbours of each other, then both are considered a pair
of tomek links. Samples belonging to tomek links are usually samples located in the boundary between
classes. From tomek links, sample from majority class is removed. This process is repeated until all the
nearest neighbours belong to the same class.
• Edited Nearest Neighbours. Edited nearest neighbours (ENN) [71] remove samples from majority class
according to k nearest neighbours. Majority class sample is removed if one of the k nearest neighbours
do not belong to the same class.
• Neighborhood Cleaning Rule. Neighborhood cleaning rule (NCR) modifies the ENN method adding
data cleaning [72], so it’s a more conservative method. In a first instance, it identifies noisy data using
ENN rule. Then, minority class is analyzed and k nearest samples are found. Neighbours that belong
to majority class are removed.
26
Table 3.7: Description of the input variables.
Feature Units or Categories TypeAge Years Continuous
Gender [Male,Female] BinaryEthnicity [Asian, Black, Hispanic/Latino, White and Other] Categorical
Admission Type [Elective, Emergency] BinaryLength of Stay Fractional days Continuous
Ventilation-Time Hours ContinuousVentilation-Number (Frequency/ICU stay) DiscreteInfusions-Number (Frequency/ICU stay) Discrete
Diabetes [Type I, Type II, Secondary] CategoricalLong-Term Insulin User [True, False] Binary
Featureparameter
Feature (Units) Type
MinimumMaximum
MeanMedian
Standard deviationVariance
Flaga
HyperFlag∗
HypoFlag{1,2,3}∗
Laboratory analysesa
Anion gap (mEq/L)Bicarbonate (mEq/L)
Chloride (mEq/L)Creatinine (mg/dL)Hemoglobin (g/dL)
Hematocrit (%)Mean corpuscular hemoglobin (pg)
Mean corpuscular hemoglobin concentration (%)Mean corpuscular volume (fL),
Platelets (K/uL)Potassium (mEq/L)
Red blood cells (m/uL)Red cell distribution width (%)
Sodium (mEq/L)Urea nitrogen (mg/dL)
White blood cells (K/uL)Glucose (mg/dL)*
Vital SignsHeart rate (bpm),
Respiratory rate (breaths per min),Systolic blood pressure (mmHg),Diastolic blood pressure (mmHg),Mean arterial pressure (mmHg),Peripheral oxygen saturation (%)
Continuous
MinimumLast
Glasgow Comma Scale Continuous
First Weight (kg) Continuous
Sum
Urine Output (mL)Diagnoses
Problems in:Digestive system (1),
Circulatory system (2),Respiratory system (3),
Nervous system and sense organs (4),Musculoskeletal system and connective system (5),
Genitourinary system (6),Integumentary system (7).
Diseases of the blood and blood-forming organs (8)Mental illness (9)
Infestous and parasitic diseases (10)
Continuous
∗ Exclusive to categorize feature Glucose.∗ HypoFlag: Hypoglycemia in any of their severity scores (Table 2.2).∗ HyperFlag: Hyperglycemia event.
27
28
Chapter 4
Modeling
This chapter starts with a schematic of modeling process, identifying all techniques used along with respective
libraries in section 4.1. Then, an assessment of the most appropriate modeling techniques is made in section
4.2. Following sections have theoretical explanation of machine learning techniques that will be applied in
the work for modeling. Subsequently, feature selection methods are presented in section 4.5. In section 4.6,
model interpretation is detailed with SHAP values. The performance metrics used in this work are explained
in section 4.7.
4.1 Knowledge Discovery Process
Model construction was developed in Python 3.6 language with the use of several libraries widely used in data
science [73–81]. Each step and respective library associated is presented in figure 4.1. The processor used to
perform all the tests was a Intel R©Core 8th generation i7-8750H Hexa-Core, 2.20 GHz with turbo until 4.10
GHz, 9 MB Cache.
The figure 4.1 is divided in two schematics: one for the construction process of the model and other for
the external validation of the same.
For the construction process, after data preparation (chapter 3) and achieving a final input dataset, a 5x10
cross validation is performed with diverse machine learning techniques. Data sampling and feature selection
are optional steps implemented to counteract the imbalance present in the dataset and to reduce the final
feature subset respectively. Models are interpreted and features rankings (weights or SHAP values depending
on the machine learning technique used) aids during feature selection process in the case of using recursive
feature elimination (RFE). From models’ output, performance metrics are calculated and serve as support
for feature selection for the case of using sequential forward selection (SFS). In the case of recursive feature
selection (RFS) both ranking and performances serve as support. Threshold-based metrics use as threshold
the one minimizing the difference between true and false positive rate, assigning values above to class 1
(dead) and values below to class 0 (alive). For last, individualized clinical dashboards are constructed using
models’ outputs and interpretation. For the external validation, a final model is constructed using all patients
extracted from MIMIC database and the resulting features from the feature selection process. The model is
29
validated using the patients extracted from eICU database. As in the construction process, model’s outputs
are interpreted and individualized clinical dashboards constructed.
Data Preparation[73],[74],[75]
INPUT DATASETMIMIC database
Repeated K-Fold Cross Validation5x10 Cross Validation [75]
Data Sampling*Oversampling or Undersampling [76]
Performance MetricsAUC, AUPRC, Sensitivity, Specificity
[75]
Feature Selection*Recursive Feature Elimination (RFE) [75]Sequential Forward Selection (SFS) [81]
Recursive Feature Selection (RFS)
ModelMachine Learning Techniques
[75],[77],[78],[79]
Model InterpretationSHAP values [82]
Weights [75]
IndividualizedClinical Dashboards [80]
FINAL MODELDATASET
MIMIC databasePerformance
Metrics
ModelInterpretation
IndividualizedClinical Dashboards
DATASETeICU database
Training 9-Fold Testing Fold
Model Output
Threshold Choice
Until Desired Feature Subset (RFE)
Until Desired Feature Subset (SFS)
Train
Test
Figure 4.1: Model construction layout. *Optional steps
30
4.2 Modeling Techniques Assessment
It is important to assess the advantages and limitations of each modeling technique that could be applied to a
problem in particular. There are a sort of machine learning techniques that can be used separately or together
to predict an outcome.
The attributes taken in account when selecting the right algorithm were predictive accuracy, computational
speed, interpretability, simplicity, robustness and scalability [83]. Each one can be classified in the scale of
[Low, Medium, High] as represented in figure 4.2.
As the study objective of this work is to predict mortality in critically ill patients under insulin therapy for
glycemia control, predictive accuracy and interpretability play an important role for this study since the main
purpose is to create a model capable of predict patient mortality and, at the same time, exhibit a complete
picture of the health status for each patient. On the other hand, scalability owes its importance to the fact
that the study is conducted in time-windows with different sizes (section 3.5) having in mind that, in a long
term perspective, more patients can be included in the study.
Figure 4.2: Importance of trade-offs to take in account when choosing a machine learning algorithm (adaptedfrom [83])
Conjointly, a set of machine learning algorithms were tested to corroborate characteristics of the problem
and select the most suitable algorithms to go on with the study.
In table 4.1 are described the machine learning algorithms assessed for this work in terms of interpretability
[84] and simplicity [85] considering a dataset with high dimensionality. Remaining assumptions (predictive
accuracy, speed, robustness and scalability) were deduced after testing the algorithms in different prediction
31
time-windows of this study (section 5.2). These modeling techniques will be described in the sections below.
Table 4.1: Machine learning algorithms assessed based on [84, 85].
Method Abbreviation Interpretability Simplicity
K-Nearest Neighbours (n=3) KNN Low High
Support Vector Machines SVM Low Low
Decision Trees DT High High
Random Forest RF Medium Medium
Logistic Regression LR High High
AdaBoost ADA Medium Medium
Gradient Boosting GB Medium Medium
Gaussian Naive Bayes GNB High High
Linear Discriminant Analysis LDA High High
Quadratic Discriminant Analysis QDA High High
4.2.1 Logistic Regression
Logistic Regression (LR) is a widely used machine learning technique for binary classification problem. Founded
on a statistical background in 1958 by Cox [86], LR owes its name to logistic function that is the technique’s
core. Logistic function is a S-shaped curve that take any real value number and map it between 0 and 1 never
reaching the limits.
LR describe the relation between an output value and input values through a linear combinations of weights
and coefficient values.
p(x) =eβ0+
N∑i=1
βixi
1 + eβ0+
N∑i=1
βixi
(4.1)
where β0 represent the bias or intercept term and βi the regression coefficient (weight) for each input
variable
4.2.2 Linear and Quadratic Discriminant Analysis
Linear Discriminant Analysis (LDA) is a simple and mathematically robust technique for classification. LDA
make the following assumptions about dataset to estimate mean and variance for each class: each variable is
modeled as a multivariate gaussian distribution with density (equation 4.2) and gaussians for each class are
assumed to share the same covariance matrix where∑k
=∑
for all k.
Quadratic Discrimant Analysis (QDA) is equivalent to LDA, with the difference to estimate covariance
matrix for each class.
32
P (X|y = k) =1
(2π)12 |∑k
| 12exp(−1
2(X − µk)t
−1∑k
(X − µk)) (4.2)
In equation 4.2, k represent each class, µk is the class means and d is the number of features. Both
methods, uses Bayes theorem to estimate the probabilities of data belonging to each class (equation 4.3)
P (y = k|X) =P (X|y = k)P (y = k)
P (X)(4.3)
4.2.3 Support Vector Machines
Support vector machines (SVM) is a classifier defined by the construction of a hyperplane or a set of hyperplanes
in a multidimensional space separating different classes. The hyperplane chosen is the one with maximum
distance between data points of both classes also called maximum margin.
Data points closer to the hyperplane are support vectors that influence position and orientation of the
hyperplane. Those maximize classifier’s margin so test points can be classified with more accuracy.
SVM handles create non-linear regions to separate class more efficiently using kernel functions. Among
kernel functions available, radial basis function (equation 4.4) is the one used in this work.
k(xi, xj) = exp(−||xi − xj ||2
2σ2) (4.4)
where xi and xj are feature vectors and ||xi−xj ||2 the euclidean distance among those. σ defines how much
influence a single training example has.
4.2.4 K-Nearest Neighbours
K-nearest neighbours (KNN) is one of the simplest classifiers that work as a majority voter of the nearest
neighbours of each data point. The principle is to find a predefined number of training samples (k) closest in
distance to the test sample and predict the label from these from majority voting.
The metric used to compute proximity among samples is Minkowski distance as presented in equation 4.5.
(
k∑i=1
(|xi − yi|)q)1/q (4.5)
Here, x represent training samples and y test samples. q can assume values 1 or 2, representing Manhattan
and Eucledian distances respectively. For the work q = 2 and k = 3.
4.2.5 Gaussian Naive Bayes
Gaussian Naive Bayes (GNB) is a probabilistic classifier based in Bayes’ theorem with the naive assumption
of conditional independence between every pair of features. The likelihood of the features is assumed to be
gaussian and probability is calculated as described in equation 4.6.
33
P (xi|y) =1√
2πσ2y
exp(− (xi − µy)2
2σ2y
) (4.6)
Where paramenters σ2y and µy are estimated using maximum likelihood.
4.3 Ensemble learning
Ensemble learning is considered the machine learning analogy for the wisdom of the crowd. Considering learning
algorithms as individuals, the collective knowledge from different and independent individuals typically exceeds
the knowledge of any single individual.
Diverse approaches to the concept were made through the time commencing in 1785 with the Condorcet
jury theorem [87] proving that a jury of partially informed voters is more probable to take the correct decision
under the plurality voting rule than any single voter alone. In early 1900s, Galton [88] made an experiment
in a fair encouraging 787 uneducated farmers to guess the weight of an ox obtaining an error less than 1%
between median of guesses and true weight.
The four major concepts required to form a wise crowd capable of improving on individual knowledge [89]
are:
• Diversity of opinion - People in crowd should have a range of experiences, education and opinions.
• Independence - Each person’s opinion is not affected or influenced by others.
• Decentralization - People have specializations and can make conclusions based on local information.
• Aggregation - Mechanism for turning all predictions into a collective decision.
Applying the concept to machine learning, ensemble learning can be described as a combination of several
base estimators (”weak learners”) in order to produce one optimal predictive model (”strong learner”).
Ensemble methods can be classified into two main groups: parallel/independent methods and sequen-
tial/dependent methods. In parallel methods, multiple estimators are built independently and predictions
combined using model averaging techniques (e.g. bagging [90], random forests [91]). In sequential methods,
estimators are built sequentially and one tries to reduce the errors of the combined estimator (e.g. boosting
[92], gradient boosting [93]).
4.3.1 Parallel Methods
Bagging
Bagging stands for bootstrap aggregating. It was proposed by Breiman [90] in 1996. It is a method for
generating multiple versions of a base estimator using bootstrap replicates of training set by sampling with
replacement. Each replicate work as a new training set containing the same number of instances as in the
original dataset to ensure sufficient amount of instances per estimator. The aggregation performs an average
of the output (for regression) or a majority voting (for classification) to get an optimized estimator. Besides
being usually used with decision trees, the method can be used with any type of estimator.
34
Random Forest
Random forests were first introduced in 1995 in [94] but came out to be efficient in [91] as an extension
over bagging. Instead of just replicate samples with replacement also just a random subsample of features
were used in the training process. This avoids excess of similarities among estimators and high correlations in
model’s predictions. For the work, a total of 100 decision trees as estimators were used.
4.3.2 Sequential Methods
The idea of boosting is whether a weak learner can be converted into a strong learner by sequentially training
weak learners, each trying to improve its predecessor predictions.
Boosting has its beginnings in the study of Valiant [95] called Probably Approximately Correct (PAC)
model that became the base for Kearns work [96]. Then, the conception that a weak learner which performs
slightly better than random guessing can be ’boosted’ into a strong learner are credited to Kearns, Valiant
and Schapire [95–97]
AdaBoost
AdaBoost, short form for adaptive boosting, is the most well-know boosting algorithm and the first that
achieved really success being the root of several notable variations [98].
The key idea behind AdaBoost is to weight the same training data, increasing misclassified instances’
weights while decreasing the weights of correctly classified instances. Initially all instances has the same
weight. Iteratively new weak learners are added focusing on less weighted instances. Each weak learner has
an individual weight according to overall predictive performance. Predictions are made by calculating the
weighted average of the weak classifiers.
Statistical boosting and Gradient Boosting Machines
Many machine learning approaches, including AdaBoost, can be considered as black boxes since they might
yield accurate predictions but the structure of relationships between input data and output is not clearly
interpretable. In opposition, statistical models aims to describe and explain relationships in a structured way
given a variables importance quantification, as well the effect of this variables in the interpretation.
AdaBoost and related algorithms were recast in a statistical framework by Breiman [99] showing that
boosting can be understood as a functional gradient descent algorithm. Afterwards, Friedman [93], based on
Breiman initial approach, elaborated a statistical boosting point of view called gradient boosting machines
(GBM).
GBM build a stage-wise additive model where each weak learner is added depending on previous weak
learners performing gradient descent in a functional space [93]. Algorithm in 1 presents an overview of the
machine learning technique. For a further detailed mathematical formulation, consult Appendix B.
GBM using decision trees as weak learners, commonly named gradient boosting decision trees (GBDT),
became widely used due to simplicity and few advantages of this learners namely, requiring less effort for data
preparation, deal with nonlinear relationships between parameters and handling outliers as well missing values.
35
Algorithm 1 Gradient boosting algorithm
Inputs:
• Input data (X, y)Ni=1
• Number of iterations M
• Loss-function choice ψ(y, F )
• Weak learner h(X, θ)
Algorithm:
1: Initialize f02: for t=1 to M do3: Compute the negative gradient gt(x)4: Fit a new weak learner h(X, θt)5: Find the best gradient descent step-size ρt:
6: ρt = arg minpN∑i=1
ψ[yi, ft−1(Xi) + ρh(Xi, θt)]
7: Update the function estimate8: ft ← ft−1 + ρth(X, θt)9: end for
In recent years, one new GBDT framework has been recognized . This framework, proposed by Chen [77],
is called XGBoost (XGB) and it has demonstrated a superior performance compared to random forests [91].
Recently, Microsoft presented a novel approach called LightGBM (LGB) [78]. This one has faster training
speed, higher efficiency and better accuracy. Then it was presented CatBoost (CB) [79] by Yandex. This
achieved significant improvements in benchmark datasets comparatively to XGB and LGB.
In figure 4.3 is represented a timeline of the ensemble learning techniques described theretofore.
Figure 4.3: Timeline of ensemble learning.
36
4.4 Gradient Boosting Frameworks
Gradients boosting models are based in decision trees models. These split a dataset into small subsets while,
in a incremental way, it develops a structured decision-making process. Making an analogy with real-world
trees, decision trees grow upside down and dataset is split into branches through decision nodes partitioning
the data into subsets in the feature with the largest information gain.
Top decision node is called root node and nodes that do not lead to a decision-making process are titled
leafs. Branches make the connection between the root and leaves with the support of internal decision nodes.
Since this thesis work will focus in these techniques, it is needed to explain with more detail how to tune
GBM models in this work (i.e. CB [79], XGB [77], and LGB [78]).
4.4.1 Tree Structure
There are two different strategies when growing the decisions trees: level-wise and leaf-wise.
In level-wise strategy each node splits the data prioritizing the nodes closer to the root maintaining a
balanced tree whereas in leaf-wise strategy tree growing is made by splitting the data in nodes with the
highest loss change being more prone to overfitting.
XGB and LGB use the leaf-wise strategy while CB use level-wise strategy with the particularity to use
oblivious trees characterized by the constraint of allowing to select only one feature in a particular level.
4.4.2 Best split method
Finding the best split for each node is a key challenge in training a GBDT. Decision tree models split each node
at the feature with the largest information gain measured by the variance after splitting. In large datasets is
computationally expensive to go through every data point in each feature to find the best split so it’s required
to use approximation methods to decrease training time.
Histogram-base method split each feature’s data points into discrete bins and use these bins to perform
the best split value of histogram.
Pre-sorted splitting method sort the data points by feature value in order to calculate gradient statistics
and propose candidate split points. Then calculate information gain for each candidate split point along each
feature to find the best split. Among all features take the best split solution for a node.
Gradient-based One-Side Sampling (GOSS) method ranks in a descending order the training data points
according to the absolute values of their gradients. Preserves the top a × 100% data points with larger
gradients and performs random sampling in (1− a)× 100% data points with smaller gradients. Sampled data
with small gradients is amplified by a constant 1−ab when calculating the information gain. So, the split point
is calculated over a smaller subset reducing the computational cost.
GOSS is exclusively used by LGB while XGB and CB use pre-sorted algorithm. Both XGB and LGB have
the option to use histogram-base method.
37
4.4.3 Loss Function
In agreement with algorithm 6, determining the loss function (also know as the objetive parameter) is required
to fit a new weak learner.
Logarithmic loss, typically referred as logloss or cross entropy loss, is a classification metric based in
probabilities where a probability to belong to a class is assigned to each sample than simply yielding the most
likely class to the sample.
Logloss per sample is the negative log-likelihood of the classifier (equation 4.7)
Llog(y, P ) = − log(y|p) = −(y log(p) + (1− y) log(1− p)) (4.7)
Logloss is used in XGB, LGB, and CB for binary classification.
4.4.4 Hyperparameter tuning
Hyperparameters are the variables which determine the structure of an algorithm and establish how this
is trained. Their importance relies on controlling the behavior and performance of the training algorithm.
Hyperparameters differ from model parameters since they can’t be directly learned during training (e.g. in
algorithm 1, a model parameter is optimized evaluating the gradient of a loss function during training).
The process of tuning hyperparameters, also called hyperparameter optimization, is based in two questions
according to [100]:
• Which of the algorithm’s hyperparameters matter most for empirical performance?
• Which values of these hyperparameters are likely to yield high performance?
Following the contributions of [101] and developers’ suggestions [77–79], hyperparameters with higher
importance and respective typical range of values used are described in table. 4.2.
Table 4.2: Hyperparameters to tune.
Function Parameter XGBoost LightGBM CatBoost Range
ControlOverfitting
Learning Rate learning rate learning rate learning rate [0.01; 0.1]
Maximum tree depth max depth max depth depth [1; 10]
Tree’s number of leafs min child weight num leaves - [1; 50]
Iterations/No. of trees n estimators n estimators iterations [50; 500]
ControlSpeed
Feature subsample colsample bytree feature fraction bootstrap type [0.1; 0.9]
Bagging subsample bagging fraction rsm [0.1; 0.9]
4.5 Feature selection
Feature selection targets to remove irrelevant, redundant or noisy features to obtain a subset of relevant
features fulfilling higher accuracy, lower computational cost and better model interpretability. Feature selection
methods can be categorized in four groups: filter, wrapper, hybrid and embedded.
38
Filter methods use statistical data analyzes such as chi-square or mutual information to assign scores
to features, removing the ones with lower score according to a specified threshold. Despite being fast and
independent of a classifier, are potentially naive in minimizing redundancy between features.
In contrast with filter methods, wrapper methods use a classifier and a performance measure to score a
subset of features. Many search procedures are possible which leads to being slower and computational costly
when dealing with high-dimensional datasets. However, usually produce better results compared with filter
methods.
Hybrid methods combine filter and wrapper methods sequentially. At first, a subset is selected through a
filter method and then a wrapper selection is performed to search for the best subset.
Embedded methods were proposed to bridge the gap between filter and wrapper methods. In contrast to
those methods, embedded methods do not separate the classifier learning from the feature selection.
4.5.1 Recursive Feature Elimination
Recursive feature elimination (RFE) was introduced in [102] as an instance of backward feature elimination
for gene selection in cancer classification. The main principle is recursively remove features based on their
importance producing smaller feature subsets until a desired number of features or in accordance with a
performance measure. RFE is described in algorithm 2.
In this work, RFE was performed removing at each step the feature with less importance, i.e, the feature
with lower absolute SHAP value or weight. Absolute value is taken into account because features can either
have a positive impact on final outcome favoring patient’s mortality or a negative impact favoring patient’s
chance of survival. This process is carried out over a 5x10-fold cross validation (see further in section 4.7) and
at the end the mean absolute impact is calculated for each feature. Then, features are ranked by importance
in an ascending order and cross validation is performed again eliminating each feature one-by-one according
to its ranking. Mean AUC and mean AUPRC are recorded when eliminating the corresponding feature.
Algorithm 2 Recursive Feature Elimination
Inputs:
• Stop criterion, M
• Ranking criterion, ω
• Classifier, h
Algorithm:
1: while M do2: Train classifier, h3: Compute ranking criterion, ω for all features4: Remove n feature with smallest ranking criterion5: end while
39
4.5.2 Sequential Forward Selection
Sequential forward selection (SFS) is a wrapper method using a greedy search approach to select the best
feature’s subset. Succinctly, each feature presented in dataset is evaluated individually and based in a specific
metric, the one that returns the best performance is selected. From then, in an interactive process each one
of the remaining features is evaluated conjointly with the first feature selected and the subset presenting the
best performance is selected. This process is repeated for all the features present in the dataset or until a
desired number of features. SFS is described in algorithm 3.
In this work, SFS was performed until a subset of 30 features since an higher of number of features is not
desired and undergoing through all feature subset would be infeasible in terms of computational cost due to
the relatively high number of features that compose the dataset. Moreover, AUC was the performance metric
used to select the feature in each iteration.
Algorithm 3 Sequential Feature Selection
Inputs:
• Stop criterion, M
• Performance metric, ω
• Classifier, h
Algorithm:
1: Create empty set of features Yk = ∅2: while M do3: for each feature (y) not in Yk do4: Train classifier, h for subset Yk + y5: Compute performance metric, ω6: end for7: Yk = Yk + y with maxω(Yk + y)8: end while
4.5.3 Recursive Feature Selection
In this work is proposed a novel approach entitled recursive feature selection (RFS). RFS is an hybrid approach
between RFE and sequential methods. Following the same principle of RFE to eliminate at each iteration
features with less importance, in this method the k less important features are eliminated individually at the
time and the feature subset that returns the best performance, according to a specific metric, is the subset
choosen. Interactively, this process is repeated until a desired number of features or along all feature set. RFS
is described in algorithm 4.
In this work, RFS was performed testing at each step the 5 less important features, i.e, features with lower
absolute SHAP value or weight depending on the method used. This process is carried out over a 5x10-fold
cross validation to evaluate features’ importance and also to evaluate the performance when removing each
feature selected as less important. Then, feature associated to the best performance is the feature selected to
remove. AUC is the metric selected to evaluate the performance during the feature selection process.
40
Algorithm 4 Recursive Feature Selection
Inputs:
• Stop criterion, M
• Ranking criterion, ω1
• Performance metric, ω2
• Classifier, h
Algorithm:
1: while M do2: Train classifier, h3: Compute ranking criterion, ω1 for all features4: for k lowest ranked features do5: Eliminate feature from subset6: Compute performance metric, ω2
7: end for8: Selected subset that achieve highest performance9: end while
4.6 Model Interpretation
As referred in section 4.3.2, black boxes algorithms can be recast in statistical frameworks to achieve model’s
interpertability. However, high-dimensionality of datasets affects model’s complexity, reducing interpretability
in order to achieve better results. This creates a trade-off between accuracy and interpretability which is
necessary to achieve.
4.6.1 Features Ranking
Knowing the importance of each feature in the final model’s prediction is a significant acknowledgment that
can be used to rank features and inform a feature selection process. Features’ importance are calculated
either in an individualized manner for a single prediction or over an entire dataset to describe model’s global
behaviour.
Linear models, as LR and LDA, in a simple approach assign weights to each feature considering its impor-
tance to the model. Other techniques used are p-values or bootstrap scores.
Tree-based ensemble methods, as GBDT, use three different methods to estimate the importance of
features in a dataset. These are: gain, split count and permutation.
The simplest approach, split, count how many times a feature is used to split a tree’s node while gain sum
the totality of information gains acquired by all splits for a given feature. Distinctively, permutation observe
the model’s error change by randomly permuting the values of a feature in a test set.
Lundberg et al. [82] proved, with exception to permutation methods, that feature importance attribution
methods are inconsistent, i.e. a given feature’s importance can decrease due to model changes when in fact
the model depends more on that feature. To overcome this assumption, a unification approach was proposed
through SHapley Additive exPlanation (SHAP) values.
41
4.6.2 SHapley Additive exPlanation values
SHAP values were founded on ideas from game theory and local explanations. Developed based on the premise
to view any model’s prediction explanation as a model itself [82], this method unified six other current methods
in a class named additive feature attribution methods. Combining game theory with the unified class, SHAP
values provide a unified measure of feature importance better aligned with human intuition.
Initially focused in linear models (as LR and LDA), kernel methods (as SVM) and deep learning algorithms,
a seventh method was added to compute SHAP values for trees and tree-based ensemble methods [103], as
GBDT.
SHAP values for each feature in a linear model are estimated by using equation 4.8 while for tree-based
models are computed by estimating E[f(x)|xS ] using equation 4.9 where fx(S) = E[f(x)|xS ]. In a high-level
description, equations 4.8 and 4.9 are the difference between model’s prediction with and without feature i
(equation 4.10).
φi = βi(xi − E[xi]) (4.8)
φi =∑
S⊆N/{i}
|S|!(M − |S|!− 1)
M ![fx(S ∪ {i} − fx(S)] (4.9)
Importance of Featurei = fx(with feature i)− fx(without feature i) (4.10)
For a further reading on a more detalied mathemaical description, see Appendix D and authors explanation
[82, 103].
4.7 Model Performance Assessment
4.7.1 Repeated K-Fold Cross Validation
In k-fold cross validation, the original dataset is partitioned in k equal sized subsamples also called k folds.
From those, k − 1 folds are used for training and the sample remaining is used for training. This process is
repeated k times so all folds serve as training and testing.
Then, k-fold cross validation is repeated n times with different randomization among samples that consti-
tute folds. Results obtained in all n x k folds are averaged.
Performance metrics aid on evaluating and compare different models. The study objective is a binary
classification problem where either a patient die (1) or survive (0) and the goal is to predict that outcome.
Classification models predict the likelihood of a patient dying as a probability of a patient belonging to a
given class (die or survive). Patient can be assigned to a class by choosing the class with highest probability
or by defining a threshold that allocates probabilities above or below the same into classes.
4.7.2 Sensitivity, Specificity and Precision
Confusion matrix is a table that describes the performance of a model summarizing prediction outcomes. In
a binary classification problem, four possible outcomes are possible:
42
Figure 4.4: Confusion Matrix (source [104])
• True positive(TP): Positive cases correctly classified
• True negative(TN): Negative cases correctly classified
• False positive(FP): Positive cases incorrectly classified
• False negative(FN):Negative cases incorrectly classified
With these four outcomes can be estimated the performance metrics used to asses the models. These
metrics are described on next subsections.
Sensitivity or Recall
Sensitivity is the fraction of positive predictions correctly classified among total number of positive cases.
Represent the number of death patients correctly classified among those who actually died.
Sensitivity =TP
TP + FN(4.11)
Specificity
Specificity is the fraction of negative predictions correctly classified among the total number of negative cases.
Represent the number of correctly classified patients that survived among those who actually survived.
Specificity =TN
TN + FP(4.12)
Precision
Precision is the fraction of positive predictions correctly classified among the total number of positive classified
cases. Represent the number of death patients correctly classified among those who actually died plus those
who were incorrectly classified as death.
Precision =TP
TP + FP(4.13)
Specificity, sensitivity and precision are threshold-based metrics where the threshold choice in favor of one
of these metrics can lead to a performance decrease in the remaining. When dealing with imbalanced datasets,
these metrics also tend to favor majority class making it impossible to obtain a coherent model classification.
43
Thus for the study, metrics threshold-free will be used to have an generalized view of model classification
performance. Area under the receiver operating characteristic curve (AUC) and Area under the Precision-Recall
curve (AUPRC) are the metrics chosen.
The threshold selected is based on AUC minimizing the distance between false positive classification rate
(1-specificity) and sensitivity.
4.7.3 Area under the receiver operating characteristic curve
Receiver Operating Characteristic (ROC) curve plots true positive rate (sensitivity) vs false positive rate (1-
specificity) at different thresholds. AUC measures the entire two-dimensional area underneath the ROC curve
and characterize how much a model is capable of distinguishing between classes. An AUC value of 0.5
correspond to a random classifier while a value of 1 correspond to a perfect classifier.
However, AUC might lead to an overoptimistic picture of the model in the presence of a class imbalance
what may incur in a false interpretation of model performance [105]
4.7.4 Area under the Precision-Recall curve
AUPRC shows the trade-off between precision and recall (sensitivity) for different thresholds. AUPRC measures
the area underneath PRC. A classifier is considered random performing depending on the ratio ( PositivesPositives+Negatives )
between positives (1) and negatives (0). So, AUPRC performance is correlated with the presence of imbal-
anced datasets, i.e, an AUPRC=0.5 for a balanced dataset is a poor performance and represent a random
classifier but for a highly imbalanced dataset is considered a good performance.
AUPRC is preferred in imbalanced dataset because it is more informative than AUC when evaluating binary
classifiers [105].
44
Chapter 5
Results
This chapter presents the main results and justify all the decisions taken to choose a final model. In section
5.1 is described briefly the cohort containing all patients under insulin therapy to have a general overview of
data distribution. Sections 5.2 and 5.3 focus on choosing the best-performer machine learning algorithm to
proceed with the study. It was made a performance analysis between the first- and last-day data. In section
5.4, is performed a detailed analysis using the first-day data in order to fine tune the hyperparameters and
to examine how sampling techniques affect models’ performance. Feature selection is also implemented to
reduce the final number of features. In section 5.5, the results obtained are compared with common severity
scores used in ICUs to predict patients mortality. Further, results for different time-windows are presented in
section 5.6. Lastly, best model results obtained are compared with the results from similar studies previously
presented in section 2.3.
5.1 Descriptive Analysis of the Cohort
In figure 5.1 is shown a dashboard with a general overview of data from all patients (12338) under insulin
therapy along all their respective ICU stay.
This dataset is composed predominantly by patients of white race and masculine gender. The patient’s
age is mostly between 60 and 80 years old and also there is a significant number of patients above 89 years
old, indicating that this is an aged dataset.
Besides the condition of being under insulin therapy, more than half of these patients (56.8%) are non-
diabetic. Among diabetic patients, the majority are diabetic type II representing 36.3% of all dataset.
As previously analyzed in section 3.5.1, the imbalance present in dataset is significant (11.4%). However,
analyzing patients’ lenght of stay by days in the ICU, it is notorious a decrease in imbalance along with length
of stay increase. This indicates an higher risk of mortality is associated to a longer stay in ICU.
The most common length of stay in ICU is between 1 and 2 days but also stays longer than 10 days
represent a significant number of patients.
45
Figure 5.1: General overview of the working dataset (All ICU stay, n = 12 338)
5.2 Selection of a Machine Learning Technique
Several machine learning algorithms were tested with a 5x10-fold cross validation imputing independently the
covariates gathered during both first- and last-day a patient is admitted to the ICU.
Analyzing figures 5.2 and 5.3 along with table 5.1, five algorithms stand out in predictive performance and
deserve a further analysis. Citing, GB, LR, LDA, ADA and SVM. From those, GB and LR were the chosen
ones to proceed with the study due to the explanations given below.
For the GB model, despite being computational costly as seen in figures 5.2c and 5.3c where GB is the
second most time consuming model (275.73s and 329.07s), it has the highest predictive performance in both
first (AUC 90.97±1.23, AUPRC 54.16±4.86) and last day (AUC 92.53±1.12, AUPRC 73.50±3.09) analysis.
LR and LDA present quite similar performances. LR gives a slightly better predictive performance (AUC
89.85±1.55 and 91.55±1.03, AUPRC 50.69±5.36 and 69.48±3.20), while LDA (6.65s and 6.86s) outperforms
LR (16.14s and 20.66) in computational cost. Since predictive performance has a more important role, LR
prevails for further study and also because it is the baseline model in health data analysis.
SVM and ADA are discarded because they achieved lower performance compared with the chosen algo-
rithms in terms of computational cost (SVM: 981.91s and 1560.42s ) and predictive performance (ADA: AUC
88.98± 1.65 and 90.94± 1.31, AUPRC 47.01± 5.52 and 68.47± 3.00 ).
46
Figure 5.2: Performance analysis where different machine learning algorithms are compared (First-day)
Figure 5.3: Performance analysis where different machine learning algorithms are compared (Last-day)
Table 5.1: Performance metrics for machine learning techniques
Algorithm AUC [%] AUPRC [%] Sensitivity [%] Specificity [%] Time [s]
First-day
KNN 71.74± 2.52 27.05± 3.58 52.47± 4.95 89.92± 0.90 115.51
SVM 88.28± 1.84 51.84± 5.09 80.80± 2.22 80.90± 2.33 981.91
DT 65.35± 2.64 19.14± 2.77 37.56± 5.28 93.14± 0.96 65.51
RF 83.53± 2.22 37.25± 4.71 75.03± 4.69 80.31± 3.46 24.44
LR 89.85± 1.55 50.69± 5.36 82.35± 1.96 82.18± 1.93 16.14
ADA 88.98± 1.65 47.01± 5.52 81.38± 2.36 81.41± 2.34 166.23
GB 90.97± 1.23 54.16± 4.86 83.19± 2.07 83.13± 2.03 275.73
GNB 81.58± 2.14 26.59± 2.93 73.47± 2.14 73.54± 2.20 2.22
LDA 88.45± 1.94 50.21± 5.09 80.99± 2.00 81.11± 1.95 6.65
QDA 78.17± 3.02 28.79± 4.26 73.58± 3.09 73.42± 3.20 4.77
Last-day
KNN 73.30± 2.09 37.08± 3.50 57.43± 4.01 86.92± 1.00 138.36
SVM 90.27± 1.40 68.12± 3.69 82.28± 1.78 82.29± 1.88 1560.42
DT 70.79± 2.16 30.89± 2.87 49.59± 4.21 91.99± 0.90 70.88
RF 86.76± 1.94 55.44± 3.81 81.76± 4.18 76.86± 4.47 27.05
LR 91.55± 1.03 69.48± 3.20 83.98± 1.52 83.88± 1.59 20.66
ADA 90.94± 1.31 68.47± 3.00 82.93± 1.55 82.92± 1.67 201.80
GB 92.53± 1.12 73.50± 3.09 84.69± 1.76 84.83± 1.73 329.07
GNB 81.47± 1.58 35.04± 2.72 74.28± 1.58 74.22± 1.53 2.36
LDA 90.83± 1.28 68.21± 3.46 83.42± 1.42 83.34± 1.51 6.86
QDA 81.51± 2.51 36.28± 3.14 75.00± 1.93 75.10± 1.87 4.81
47
5.3 Selection of a Gradient Boosting Framework
For the case of GB, three different frameworks (i.e. XGB,LGB and CB) were tested and compared with the
original algorithm (GB).
A set of default parameters were assessed in order to evaluate the performance of each GB algorithm.
This would help to select a model to be extensively evaluated for hyper-parameter tuning. For this step,
subsampling and feature sampling were discarded to avoid biased information between the algorithms with
regard to computational and overall performance. The parameters choosen were: learning rate = 0.1,
max depth = 5 and iterations = 100 along with a 5x10-fold cross validation .
Analyzing figures 5.4 and 5.5 along with table 5.2, overall performance is quite similar through all algo-
rithms. XGB emerge as the best performance algorithm in both AUC (91.36 ± 1.28 and 93.22 ± 1.06) and
AUPRC (56.57± 4.66 and 75.04± 2.94) for the first- and last-day analysis followed by LGB and CB. GB has
the lowest performance (considering all metrics) among all algorithms.
Nonetheless, LGB completely stands out in terms of computational time performance (5.30s and 11.40s),
running in less than 80% of time comparing to the second fastest algorithm (i.e. CB) and less than 95% of
time compared to XGB that achieved the highest performance, being for that reason the selected algorithm
in detriment to the original algorithm (GB).
Figure 5.4: Performance analysis comparing GB algorithms (First-day)
Figure 5.5: Performance analysis comparing GB algorithms (Last-day)
48
Table 5.2: Performance metrics for each GB algorithm
Algorithm AUC [%] AUPRC [%] Sensitivity [%] Specificity [%] Time [s]
First-day
GB 90.97± 1.23 54.16± 4.86 83.19± 2.07 83.13± 2.03 275.73
XGBoost 91.36± 1.28 56.57± 4.66 83.70± 1.76 83.65± 1.89 154.12
LightGBM 91.06± 1.32 56.05± 4.70 83.17± 1.89 83.35± 2.03 5.30
CatBoost 91.19± 1.35 55.41± 5.06 83.63± 2.03 83.78± 1.85 42.58
Last-day
GB 92.53± 1.12 73.50± 3.09 84.69± 1.76 84.83± 1.73 329.07
XGBoost 93.22± 1.06 75.04± 2.94 85.80± 1.71 85.81± 1.75 172.26
LightGBM 93.09± 1.07 74.94± 2.81 85.57± 1.59 85.51± 1.70 11.40
CatBoost 92.57± 1.14 74.28± 3.06 84.85± 1.71 84.80± 1.83 74.84
5.4 First-day Analysis
5.4.1 Hyperparameter tuning
First-day data will be the baseline dataset to tune the hyperparameters that will be used in the other time-
windows. LR has a simpler approach than LGB, so hyperparameter tuning is not required.
For LGB, hyperparameter tuning was divided in two steps in order to control model tendency to over-fitting
and check how fast the model converges.
First step was to check the trade-off between the number of estimators and learning rate. The values
chosen for learning rate were [0.01, 0.03, 0.05, 0.07, 0.1] while the number of estimators varied between
[25− 1500]. Those ranges represent a more exploratory process than those proposed in section 4.4.4.
In figure 5.6 is shown that high learning rates lead to a faster convergence of the model but it is visible,
through the decrease of performance metrics, that the model tends to over-fit as the number of estimators
increase. A lower learning rate require more estimators to converge, but increase the computational cost.
Hence, a learning rate = 0.03 and No.Estimators = 200 were chosen for modeling since these values
demonstrated that, at the same time, it avoids over-fitting and preserve a relative fast convergence.
Figure 5.6: Number of Estimators vs Learning Rate
The second step, already with a fixed learning rate and No.Estimators, was to analyze how trees depth
and number of leaves influence model’s performance. The values chosen for max depth were [3, 4, 5, 7, 9] and
values for num leaves within a range of [3− 50].
From the results shown in figure 5.7, remembering that LGB uses a leaf-wise growth (section 4.4.1), it is
49
noticeable that greater number of leaves, combined with a smaller tree depth, has no influence since a tree
is constrained by depth and vice-versa. Num leaves=10 and max depth = 5 were the chosen parameters
to keep in the study because this combination gives a better performance in both AUC (91.44 ± 1.36) and
AUPRC (56.78 ± 4.85) before the model began to vary the results in an non-linear way, which may be an
indication of model over-fitting.
Figure 5.7: Number of leaves vs Max depth.
It should be noted that in both hyperparameters choices, the values chosen from the graphs are never the
ones that achieve the maximum value registered in both AUC and/or AUPRC; and by rule, are the preceding
ones whose divert the models of over-fitting.
In table 5.3 is shown the results for the first-day analysis with LR and LGB with the selected hyperparameters
before.
Table 5.3: Performance metrics for first-day dataset.
Method AUC [%] AUPRC [%] Sensitivity [%] Specificity [%]
First-dayLGB 91.44± 1.36 56.78± 4.85 83.97± 2.03 83.68± 1.91
LR 89.85± 1.55 50.69± 5.36 82.18± 1.93 82.35± 1.96
5.4.2 Effects of Sampling Techniques
Sampling tecnhiques were applied in order to counteract the imbalance present in the dataset. Oversampling
and undersampling were applied in both algorithms; while subsampling and feature sampling were only applied
to LGB since it is an ensemble method (see section 4.4.4).
Oversampling techniques
The results for each technique are presented in table 5.4. Picking random samples from the minority class is
the preferred method comparatively to create synthetic new samples (AUC 91.12±1.28, AUPRC 54.71±5.28)
Compared to the results in table 5.3, there is a reduction in all performance measures when using oversampling
techniques.
50
Table 5.4: Performance metrics for oversampling techniques
Method AUC [%] AUPRC [%] Sensivity [%] Specificity [%]
LGB
RandomOversampling
91.12± 1.28 54.71± 5.28 83.31± 2.13 83.39± 2.03
SMOTE 90.63± 1.38 53.18± 4.69 81.98± 2.04 82.03± 1.97
LR
RandomOversampling
89.93± 1.42 47.51± 5.17 82.11± 1.65 82.11± 1.57
SMOTE 89.60± 1.48 48.04± 5.33 82.03± 1.82 82.04± 1.81
Undersampling Techniques
In table 5.5 is shown the results for each technique. From the observation of the same, it is verified that
as more instances belonging to the majority class are removed, worse results are obtained. AUPRC is the
performance measure most negatively affected. Among all techniques, Tomek links is the one which achieves
higher performance (AUC 91.40± 1.36, AUPRC 56.53± 4.90).
Table 5.5: Performance metrics for undersampling techniques
Method AUC [%] AUPRC [%] Sensivity [%] Specificity [%]
LGB
RandomUndersampling
90.45± 1.27 50.92± 4.74 82.50± 2.27 82.56± 2.17
TomekLinks
91.40± 1.36 56.53± 4.90 83.62± 1.97 83.75± 2.00
ENN (n=1) 91.31± 1.37 55.51± 5.39 83.55± 2.04 83.63± 1.96
ENN (n=2) 91.25± 1.34 54.45± 5.23 83.76± 2.10 83.63± 2.12
ENN (n=3) 91.25± 1.35 54.09± 5.07 83.68± 2.09 83.56± 2.12
NCR (n=1) 91.33± 1.35 55.03± 5.15 83.40± 2.17 83.31± 2.03
NCR (n=2) 91.30± 1.34 55.75± 5.00 83.45± 2.12 83.46± 1.97
NCR (n=3) 91.25± 1.31 55.09± 4.97 83.56± 1.93 83.70± 1.89
LR
RandomUndersampling
89.76± 1.38 46.77± 5.07 82.05± 1.59 81.91± 1.70
TomekLinks
89.86± 1.54 50.67± 5.37 82.19± 1.85 82.13± 1.80
ENN (n=1) 89.97± 1.51 50.23± 5.28 82.25± 1.73 82.28± 1.74
ENN (n=2) 90.03± 1.49 49.65± 5.40 82.35± 1.85 82.37± 1.81
ENN (n=3) 90.04± 1.45 49.09± 5.40 82.43± 1.74 82.33± 1.70
NCR (n=1) 89.94± 1.51 49.95± 5.37 82.20± 1.89 82.18± 1.77
NCR (n=2) 89.79± 1.55 50.20± 5.36 82.04± 1.95 82.01± 1.84
NCR (n=3) 89.72± 1.59 49.95± 5.34 81.73± 2.06 81.84± 1.95
51
Subsampling and Feature Sampling
In figure 5.8 is summarized the results for both AUC and AUPRC with LGB. When the majority of features
or samples are discarded during training, i.e. values for features fraction or bagging fraction below to 0.5, the
results are worse. This is due to the higher probability of discarding important features to the final prediction
since this is a random process; or the number of samples can be greatly reduced in case of subsampling. So
there is not much advantage to use sampling techniques in LGB besides making the process faster, which by
itself, has already been proven not to be needed.
Figure 5.8: Subsampling and feature sampling for LGB.
5.4.3 Feature Selection
Recursive Feature Elimination (RFE) with SHAP Values for LGB Modeling
The mean absolute SHAP value after removing each feature is plotted in figure 5.9a along with respective
mean values of AUC and AUPRC. Then, in figures 5.9b,c,d it can be seen an amplified vision of mean absolute
SHAP value. In the same tune, amplified visions of mean values of AUC and AUPRC can be seen in figures
5.9e,f and 5.9g,h respectively.
Analyzing figures 5.9e,g, it is possible to conclude that there is almost no variation of AUC and AUPRC
when eliminating features with null importance. Nonetheless, there is a slight variation in both performance
metrics when features with higher rank start to be eliminated. As more features with higher importance are
eliminated, a better performance is achieved until it starts to decrease.
The AUC gradually increases until the removal of 136 features while AUPRC just increases until 96
features are removed. After these points, performance slightly decrease during the remaining process of
feature elimination. From observation of figures 5.9f,h, there is a considerable drop in both performances
when removing the 20 highest ranked features in the model.
In table 5.7 is shown the results when the set of features with highest AUC and AUPRC were selected. The
performance is quite similar in both approaches, being the AUC approach better due to the reduced number
of features. Also performance for fixed features’ subsets are registered for further comparisons.
52
Figure 5.9: Recursive Feature Elimination - LGB with SHAP Values
Sequential Forward Selection (SFS) for LGB modeling
Mean values of AUC when adding features during feature selection are plotted in figure 5.10. It is observable
that there is a continuous improve in performance until reach 20 features. From there, the variation in
performance is not significant. Performance for fixed subsets are registered in table 5.7.
Figure 5.10: Sequential Forward Selection - LGB with AUC metric
53
Recursive Feature Selection (RFS) with SHAP Values for LGB Modeling
The mean absolute SHAP value for LGB after removing each feature is plotted in figure 5.11a along with
respective mean values of AUC and AUPRC. Then, in figures 5.11b,c,d it can be seen an amplified vision of
mean absolute SHAP value. In the same tune, amplified visions of mean values of AUC and AUPRC can be
seen in figures 5.11e,f and 5.11g,h respectively.
Analyzing figures 5.11e,g, it is possible to conclude that there is a gradual increase in both AUC and
AUPRC when eliminating features until 141 and 142 features eliminated for each metric, respectively.
From observation of figures 5.11f,h, there is a considerable drop in both performances when removing the
20 highest ranked features in the model.
In table 5.7 is shown the results when the set of features with highest AUC and AUPRC were selected. The
performance is quite similar in both approaches. Also performance for fixed features’ subsets are registered
for further comparisons.
Figure 5.11: Recursive Feature Selection - LGB with SHAP Values
54
Recursive Feature Elimination (RFE) with weight vectors for LR modeling
Calculating SHAP values for LR has an expensive computational cost. In this case, instead of eliminating at
each step the feature with lower absolute SHAP value, the feature with lower absolute weight was removed.
Analogously to previous processes, the mean absolute weight value after removing each feature is plotted in
figure 5.12a along with respective mean values of AUC and AUPRC. Then, in figures 5.12b,c,d it can be seen
an amplified vision of mean absolute weight value and in figures 5.12e,f and 5.12g,h amplified visions of mean
values of AUC and AUPRC, respectively.
In figures 5.12e,g is shown the respective evolution of AUC and AUPRC when features with less impact
are eliminated. AUC peaks with a subset of 102 features, while AUPRC highest value occurs with a subset of
81 features.
In table 5.7 are also presented the performances for both subsets. Inversely to LGB, the AUC approach
results in a feature subset with higher dimensionality than AUPRC approach. Also performance for fixed
subsets are registered for further comparisons.
Figure 5.12: Recursive Feature Elimination - LR with Weight Vectors
55
Sequential Forward Selection (SFS) for LR modeling
Mean values of AUC when adding features during feature selection are plotted in figure 5.13. It is observable
that there is a continuous improve in performance until reach 20 features. From there, the variation in
performance is not significant. Performance for fixed subsets are registered in table 5.7.
Figure 5.13: Sequential Forward Selection - LR with AUC metric
Recursive Feature Selection (RFS) with Weight Vectors for LR Modeling
The mean absolute SHAP value after removing each feature is plotted in figure 5.14a along with respective
mean values of AUC and AUPRC. Then, in figures 5.14b,c,d it can be seen an amplified vision of mean absolute
SHAP value. In the same tune, amplified visions of mean values of AUC and AUPRC can be seen in figures
5.14e,f and 5.14g,h respectively.
Analyzing figures 5.14e,g, it is possible to observe that there is a linear incrase of AUC and AUPRC when
eliminating features with less weight until 70 features removed.
The AUC reaches its maximum value with a subset of 61 features and AUPRC at 75 features. After these
points, performance slightly decrease during the remaining process of feature selection. From observation of
figures 5.9f,h, there is a considerable drop in both performances when ranked features are removed.
In table 5.7 is shown the results when the set of features with highest AUC and AUPRC were selected. The
performance is quite similar in both approaches, being the AUC approach better due to the reduced number
of features. Also performance for fixed features’ subsets are registered for further comparisons.
5.4.4 Feature Selection - Comparison
Features with higher importance after feature selection processes, i.e, features extracted in the last 20 iterations
in RFE and RFS and first 20 features added during SFS are presented in table 5.6.
From these highest ranked features, features commonly shared between the different feature selection
methods are highlighted in blue for LGB model and in green for LR model in table 5.6. In the case of LGB, 9
features are shared among feature selection processes while for LR, 6 features are shared.
However, among each model it’s noticeable that the majority of the features selected appear in more than
one feature selection process and between models (LGB and LR) there is also an high similarity between
features selected.
56
Figure 5.14: Recursive Feature Selection - LR with Weigt Vectors
Table 5.6: Input features after feature selection.
Position RFE-LGB-SHAP SFS-LGB RFS-LGB-SHAP RFE-LR-WEIGHT SFS-LR RFS-LR-WEIGHT
1st last gcs last gcs last gcs aniongap mean last gcs last gcs
2nd respiratory sys respiratory sys respiratory sys aniongap max bun max aniongap max
3rd aniongap mean aniongap med aniongap mean last gcs respiratory sys respiratory sys
4th glucose mean infectious sys platelet min resprate mean aniongap mean resprate mean
5th adm type EMERGENCY age age aniongap med resprate med age
6th ventilation time platelet min infectious sys num infusion rdw max rdw min
7th age sysbp mean sysbp med glucose mean age num infusion
8th resprate mean bun flag glucose mean bun min num infusion aniongap mean
9th urineoutput num infusion ventilation time age heartrate max wbc flag
10th num infusion wbc min num infusion weight first potassium flag glucose mean
11th rdw min spo2 mean resprate mean wbc min glucose mean ethnicity OTHER
12th bun mean glucose std urineoutput creatinine max user insulin bun min
13th infetious sys ethnicity OTHER bun mean creatinine flag ethnicity OTHER spo2 mean
14th bun min nervous sys ethnicity OTHER bicarbonate flag nervous sys creatinine max
15th rdw flag sodium flag rdw min wbc flag sysbp min creatinine flag
16th platelet min creatinine max nervous sys rdw mean creatinine max heartrate max
17th bun max urineoutput sodium min heartrate max wbc flag user insulin
18th diasp min resprate mean diabp min spO2 min ethnicity BLACK aniongap med
19th ethnicity OTHER diasbp med heartrate max rdw min adm type EMERGENCY nervous sys
20th spO2 mean user insulin mental sys creatinine med sodium var diasbp min
57
Since there isn’t a consensus about which feature set is the best to pursue with the study, it is necessary
to access the performance associated to each process of feature selection.
As previously commented, results for all feature selection processes are summarized in table 5.7. The best
performance comes from SFS in LGB model with 30 features (AUC of 92.03 and AUPRC of 57.05). However
for smaller subsets (3, 5 and 7 features), which are the desired ones in a perspective of faster diagnoses and
smaller collection of variables, RFS stands out both for LGB and LR models.
For those reasons, subsets from RFS are the selected to perform external validation in eICU-CRD database.
Table 5.7: Performance metrics after feature selection
Method AUC [%] AUPRC [%] Sensivity [%] Specificity [%]No.
Features
RFE-LGB-SHAP
AUC 91.57± 1.31 56.86± 4.88 83.61± 2.12 83.61± 2.12 51
AUPRC 91.54± 1.36 57.18± 4.72 83.73± 2.28 83.80± 2.30 91
Fixed Subset
90.30± 1.37 52.69± 4.85 81.86± 2.23 81.71± 2.17 14
88.80± 1.67 48.98± 4.80 80.97± 2.05 81.07± 2.01 7
87.36± 1.63 46.09± 4.47 79.15± 1.91 79.16± 1.95 5
86.86± 1.66 45.74± 4.54 79.26± 1.89 79.32± 1.89 3
SFS-LGB Fixed Subset
92.03± 1.16 57.05± 5.28 84.45± 1.74 84.45± 1.72 30
91.31± 1.28 54.15± 4.49 83.78± 1.82 83.75± 1.82 14
90.06± 1.32 51.63± 4.97 82.53± 1.78 82.56± 1.85 7
88.61± 1.44 47.78± 4.42 80.50± 2.01 80.55± 1.91 5
86.86± 1.65 45.59± 4.82 79.36± 1.72 79.22± 1.71 3
RFS-LGB-SHAP
AUC 91.83± 1.27 57.97± 5.00 83.88± 1.95 83.89± 1.97 45
AUPRC 91.83± 1.28 58.00± 4.92 83.96± 2.03 83.90± 1.98 44
Fixed Subset
90.99± 1.29 54.88± 5.00 83.34± 1.73 83.17± 1.79 14
90.17± 1.34 52.41± 4.60 82.30± 1.96 82.17± 1.85 7
88.78± 1.39 48.94± 4.54 81.33± 1.68 81.28± 1.72 5
86.85± 1.66 45.73± 4.54 79.26± 1.89 79.32± 1.89 3
RFE-LR-WEIGHT
AUC 90.09± 1.56 51.71± 5.31 82.47± 1.86 82.41± 1.86 102
AUPRC 89.94± 1.59 52.05± 5.40 82.08± 1.88 82.13± 1.93 81
Fixed Subset
88.32± 1.53 48.49± 5.31 80.10± 2.20 80.20± 2.30 14
86.41± 1.66 44.93± 4.99 78.45± 2.20 78.65± 2.48 7
85.29± 1.93 42.85± 4.84 77.82± 2.43 77.75± 2.57 5
83.59± 2.30 40.54± 5.24 75.08± 2.37 75.09± 2.38 3
SFS-LR Fixed Subset
90.12± 1.30 49.56± 5.19 82.65± 1.76 82.54± 1.68 30
89.35± 1.31 47.79± 5.15 81.28± 1.86 81.38± 1.81 14
87.97± 1.47 44.18± 5.23 80.24± 1.86 80.17± 1.84 7
87.01± 1.56 41.53± 4.98 79.21± 2.08 79.05± 1.90 5
85.41± 1.66 35.97± 5.07 75.26± 1.97 76.32± 1.94 3
RFS-LR-WEIGHT
AUC 90.28± 1.53 51.45± 5.44 82.37± 1.86 82.37± 2.01 61
AUPRC 90.19± 1.53 51.78± 5.43 82.28± 1.71 82.38± 1.73 75
Fixed Subset
89.07± 1.39 47.97± 5.25 81.38± 1.99 81.39± 2.03 14
88.16± 1.39 45.89± 5.33 80.65± 2.13 80.60± 2.22 7
87.14± 1.49 43.99± 5.09 79.30± 1.96 79.24± 1.94 5
85.65± 1.76 38.83± 4.94 76.68± 2.24 76.67± 2.29 3
58
5.4.5 Comparing Approaches
In table 5.8 is shown a general overview of the best results obtained from all different approaches. LGB is the
choosen model since it outperforms LR in all performance metrics in all approaches. The original model after
hyperparameter tuning and models after feature selection stand out among those approaches.
Resulting model from feature selection using SFS in LGB is the best choice (AUC 92.03 ± 1.16, AUPRC
57.05± 5.28).
Table 5.8: Performance metrics to comparison between approaches for LGB
Approach Method AUC [%] AUPRC [%] Sensivity [%] Specificity [%]No.
Features
Original - 91.44± 1.36 56.78± 4.85 83.97± 2.03 83.68± 1.91 All
OversamplingRandom
Oversampling91.12± 1.28 54.71± 5.28 83.31± 2.13 83.39± 2.03 All
UndersamplingTomekLinks
91.40± 1.36 56.53± 4.90 83.62± 1.97 83.75± 2.00 All
FeatureSelection
SFS-LGB 92.03± 1.16 57.05± 5.28 84.45± 1.74 84.45± 1.72 30
5.5 Comparison with severity scores
Common severity scores performance metrics were calculated using the same 5x10-fold cross validation in
the same patients as in the models constructed. For comparison purposes, results for models with the same
minimum and maximum number of features across all severity scores are presented. Features selected for each
model are derived from RFS method in LGB.
It is verified in table 5.9 that the model proposed totally overpass all the severity scores. This might be
an indication on how useful these models could be for predictive purposes.
Table 5.9: Performance metrics to compare with common severity scores used in clinical setting.
Method AUC [%] AUPRC [%] Sensivity [%] Specificity [%] No.Features
SeverityScores
LODS 73.18± 2.87 25.41± 3.74 69.43± 3.97 66.56± 4.56 12
SAPS 73.60± 2.33 26.92± 3.59 68.35± 2.27 66.49± 3.33 14
SAPS II 77.41± 2.44 29.16± 4.07 70.40± 2.85 70.46± 2.61 12
SOFA 68.49± 3.50 24.97± 3.98 64.80± 4.37 63.76± 4.16 10
QSOFA 54.84± 2.44 10.32± 0.74 27.74± 14.90 77.91± 13.21 3
Model LGB90.99± 1.29 54.88± 5.00 83.34± 1.73 83.17± 1.79 14
86.85± 1.66 45.73± 4.54 79.26± 1.89 79.32± 1.89 3
59
5.6 Analysis for different time-windows.
In a perspective of real-time mortality prediction, data was extracted for different time-windows besides the
first 24 hours that was exhaustively analyzed in previous sections. The hyperparameters used in this section
were those previously selected in section 4.4.4 and imputing all features. To remember, for 12 and 24 hours
prior discharge length of stay in the ICU (los icu) is used as an extra feature.
In figure 5.15 and table 5.10 is shown the performance metrics evolution for different time-windows. It
would be expected that when using patients’ ICU discharge information, whether dead or alive, the predictive
performance would be higher due to fact that patients began to show signs of improvement or worsening
of their clinical status. That is verified in all performance metrics, being AUPRC the metric with the most
considerable improvement.
Figure 5.15: Analysis for different time-windows.
Table 5.10: Performance metrics for different data extraction time-windows.Time-window AUC [%] AUPRC [%] Sensivity [%] Specificity [%]
LGB
12h after admission 90.65± 1.63 48.41± 5.89 83.76± 2.33 83.81± 2.33
24h after admission 91.44± 1.36 56.78± 4.85 83.97± 2.03 83.68± 1.91
24h before discharge 92.80± 1.10 74.14± 2.90 85.24± 1.57 85.13± 1.55
12h before discharge 94.84± 0.92 78.76± 2.72 87.40± 1.57 87.44± 1.52
LR
12h after admission 89.32± 1.68 43.99± 5.55 81.40± 2.65 81.51± 2.63
24h after admission 89.85± 1.55 50.69± 5.36 82.18± 1.93 82.35± 1.96
24h before discharge 91.55± 1.03 69.48± 3.20 83.88± 1.59 83.98± 1.52
12h before discharge 93.97± 0.99 76.15± 2.75 86.97± 1.68 87.01± 1.61
60
5.7 Comparison with similar studies
Performance metrics comparison between studies is presented in table 5.11. Not all metrics used in this work
are represented due to the lack of information in the studies.
Compared to [57], where cohort is restricted to diabetic patients, the model developed in this work achieved
better results. Using a small subset with just five variables (last gcs, respiratory sys, aniongap mean,
platelet min and age), results were better with an AUC of 88.8 achieved in this work compared to 78.7
achieved in the study. To remark that, 96.6% of the population in the mentioned study are under insulin
therapy, which is an indicative of the similarity between the study and this work cohort of patients.
Compared to the highest performance found in literature (AUC of 92.7), a study [59] with no restrictions
in patients choice, the model had a similar performance (AUC of 92.0) besides dealing with a less predictable
cohort of patients, as it is possible to prove comparing the frequently used Simplified Acute Physiology Score
II (SAPS II) associated to each study (SAPS II: AUC=77.4 in this work vs AUC=80.9 in the study). The
performance presented in this work was also achieved with a smaller feature set (30 features) compared to the
mentioned study. However, in the study, diagnoses are not taken in consideration as in the model proposed.
Table 5.11: Performance comparison with literature
Studies AUC [%] Sensitivity [%] Specificity [%]No.
PatientsNo.
FeaturesInfo
Johnson et al.,2017 [59]
92.7 - - 50488 144No patientsrestrictions
Anand et al.,2018 [57]
78.7 70 73 4111 5Diabeticpatients
ModelsProposed
92.03± 1.16 84.45± 1.74 84.45± 1.729098
30 Patients underinsulin therapy88.78± 1.39 81.33± 1.68 81.28± 1.72 5
61
5.8 External Validation - eICU-CRD database
The resulting dataset extracted from eICU-CRD database is detailed in appendix A.
Models trained with all data from MIMIC database and with the resulting subsets of features obtained in
section 5.4.3 were validated with an external dataset from the eICU-CRD database [106]. These validation
results are detailed in table 5.12.
Results are compared with the previous results from 5x10 cross-validation using only MIMIC database and
also the results from using the same 5x10 cross-validation for training but using always as testing fold the
patients from eICU-CRD database.
It is noticeable that there is a slightly decrease in performance when performing external validation, around
2% in AUC metric and 4% in AUPRC metric, when comparing results using only MIMIC database.
Focusing only on external validation and cross-validation using eICU-CRD as testing fold, it is verified a
consistent increase in all subsets tested and all metrics on external validation. This may be due to the higher
number of patients used to train the model.
This result is indicative of how important would be to have a even more representative dataset and that,
in the future, the integration of eICU-CRD dataset to train the model may result in a even more robust model
with better predictive performance.
Overall, the model with the higher number of features (7) achieved the best performance (AUC of 87.99
and AUPRC of 47.09) in external validation. However, models with less features also achieved interesting
results.
Table 5.12: Performance metrics for external validation with eICU database
Method AUC [%] AUPRC [%] Sensivity [%] Specificity [%]No.
Features
LGB
ExternalValidation
(MIMIC+eICU)
87.99 47.09 81.17 81.26 7
86.79 45.61 79.91 79.92 5
84.11 41.83 77.68 77.76 3
5x10Cross-Validation(MIMIC+eICU)
87.62± 0.28 46.08± 0.81 81.15± 0.46 81.14± 0.46 7
86.67± 0.31 45.23± 0.83 79.59± 0.47 79.59± 0.48 5
83.89± 0.56 41.42± 0.77 77.32± 1.04 77.31± 1.03 3
5x10Cross-Validation
(MIMIC)
90.17± 1.34 52.41± 4.60 82.30± 1.96 82.17± 1.85 7
88.78± 1.39 48.94± 4.54 81.33± 1.68 81.28± 1.72 5
86.85± 1.66 45.73± 4.54 79.26± 1.89 79.32± 1.89 3
62
Chapter 6
Model Analysis and Interpretation
This chapter presents a model analysis and an interpretation of the results from the model proposed. In a
first instance, an interpretation of which features are more relevant and how each feature is influencing the
outcome was carried on in section 6.1. Then, a detailed analysis of features with high importance and insulin
related features was conducted and corroborated with medical studies when possible. In other perspective,
in section 6.2 were presented individual clinical dashboards representing which features are influencing each
patient individually.
6.1 Model Interpretation
First-day data and the respective fold that during cross validation achieved the best relation between AUC and
AUPRC was selected to interpret the model. All features were used since overall performance didn’t increase
significantly enough during feature selection process (section 5.4.3). This analysis, taking in consideration all
features, does not invalid the feature selection process previously performed. It only allows to get an insight
on how the features are influencing the final outcome.
The SHAP values provide the importance of each feature. In figure 6.1 is shown the 20 most important
features for each model attending the selected fold. Comparing both models, some features are commonly
important to predict the mortality in patients during their ICU stay. Last GCS value (last gcs) is the feature
with highest importance for both models. Age of each patient (age) and how long a patient were under
mechanical ventilation (ventilation time) are also features that most influenced models’ outputs. These
features will be used for the following analyses.
For visualization purposes and for knowing whether features are affecting negatively or positively the
outcome, dots representing each patient are plotted horizontally by their SHAP value (figure 6.2) and coloured
by their nominal feature value from low (blue in LGB/green in LR) to high (red in LGB/orange in LR). Giving
an example, glucose mean values can range between 40 mg/dL (blue or green) represented as Low and 500
mg/dL (red or orange) represented as High.
Moreover, dots are stacked vertically when they run out of space creating a density effect that allows to
visualize patients concentrated in a SHAP value [82]. It is also noteworthy that negative SHAP values favor
63
(a) Feature importance for LGB model (b) Feature importance for LR model
Figure 6.1: Feature’s importance ranked through SHAP values.
(a) Shap values for LGB (b) Shap values for LR
Figure 6.2: SHAP values for the 20 most important features.
class 0 (Alive) and positive SHAP values favor class 1 (Dead).
A characteristic in some features is the presence of long tails in just one side of the graphics, especially
in LGB model due to its non-linearity. For example, the general trend of long tails reaching to the right, but
not to the left, means that extreme values of these features can significantly increase mortality, but lowers
values cannot significantly lower that risk. Long tails on the right side of the graphics indicate a higher risk
of mortality. Conversely, there is a higher probability of survival if a tail is left-sided.
Giving some examples for better interpretation, lower values of last gcs indicates that risk of mortality is
especially high but higher values have less influence in patients’ survival; ventilation time has more influence
to mortality as its value increases but lower values have almost no influence in chances of survival; diseases
in human body systems ( respiratory sys and nervous sys) also influence mortality but its absence has
poor effect ; and mean value of respiratory rate (resprate mean) influence more the final outcome as long
64
as this nominal value increases.
Further, in features like age is shown a gradual influence for nominal values. Lower values have a high
impact in survival chances but as long as the value increase impact decrease until a turning value where risk
of mortality begins to be favored.
On the following subsections, it will be described and analyzed with greater detail some of the most
important features.
6.1.1 Glasgow Comma Scale, Age and Ventilation Time
As verified above, glasgow comma scale (GCS), age, and mechanical ventilation duration are the three most
important features both for LGB and LR according to the ranking list.
In figure 6.3 is shown the patients distribution by SHAP rank and respective nominal values associated
to each of those features. It is notable the different behaviour of each algorithm, where LR has a linear
distribution comparing to a non-linear distribution in LGB.
The covariate last gcs is a discrete feature with values between 3 and 15. A gradual increase favoring
mortality (positive SHAP rank) is seen as the Glasgow Comma Scale decreases. For this model, all feature
values above 14 are found below a 0 SHAP rank. In the case of LR, it is spread along the SHAP rank scale.
In the case of Age, mortality begins to be favored around 70 years old for LGB and between [55−75] years
old for LR. This trade-off is presented in a smoothest way for LR case due to the already mentioned behaviour
of each model. It should also be noted in LGB that there is a increased survival for patients below 51 years
old and an increased risk of mortality for patients above 75 years old.
For last, ventilation time in LGB influence survival chances for small values (below 17 hours) but with
a quite small impact. As the patients stays longer under ventilation, the higher is the risk of mortality with
predominance in patients that spent almost (more than 22 hours) the first full day under ventilation. For the
LR model, values above 10 hours start to favor mortality while values below favor survival chances.
Something to point out is that ventilation time values above 24 hours represent patients that a priori
entered the ICU already under ventilation.
Actually these analysis could be done in all the features in the dataset (except for those with null impor-
tance). Nonetheless, from now on the work will focus in crucial features that potentially can be related to
patients’ condition that are under insulin therapy to control their blood glucose.
65
(a) Last GCS and SHAP values for LGB (b) Last GCS and SHAP values for LR
(c) Age and SHAP values for LGB (d) Age SHAP values for LR
(e) Ventilation time and SHAP values for LGB (f) Ventilation time and SHAP values for LR
Figure 6.3: SHAP ranking and the relationship to different covariates.
66
6.1.2 Number of Insulin Infusions
The number of insulin infusions (num infusion) is a feature with relative importance in both models and a
higher number of infusions is associated to a higher chance of survival. The number of insulin infusions can
be evaluated in figures 6.1 and 6.2.
In figure 6.4a is shown that the number of infusions starts to influence the chances of survival in a smaller
value, 8 for LGB comparatively to LR which is around 10 infusions (figure 6.4b). Indeed, in the LR model, the
samples distribution form an orange long tail in the left side (same explanation for figure 6.2), which indicates
that a higher number of infusions has much more influence in patient chance of survival. In consequence, a
lower number in patient mortality.
This conclusion is a crucial for the discussion between the CIT and the IIT regimes. A higher number
of infusions is associated to an increased chance of survival. Then, the results of this work may favour the
IIT. Anyhow, it is necessary an individualized study of each patient to support such conclusion and take in
consideration the condition of being diabetic or long-term insulin user. It is also needed the validation of
clinicians.
(a) Number of infusions and SHAP values for LightGBM model (b) Number of infusions and SHAP values for LDA model
Figure 6.4: Number of infusions and SHAP values for LightGBM and LDA models.
6.1.3 Diabetes and Long-Term Insulin Users
Regarding the influence of diabetes and long-term insulin users, figure 6.5 shows that the condition of a
patient having diabetes type I or secondary diabetes has null impact on mortality prediction in LGB. However,
long-term insulin users and patients with diabetes type II has an increased chance of survival.
In LR, each condition has a positive impact on survival chance while the impact of not having any condition
is practically null for the majority of patients. To emphasize that the condition of being a long-term insulin
user and diabetic type I are ones of the highest ranked features in LR (figure 6.1).
In an overall perspective, patients that have any type of diabetes have a augmented chance of survival.
This may be due to the fact that they need to be exposed to insulin previously to survive. Actually, the
condition of being a long-term insulin user have the highest influence in both models which serves as the basis
67
for the previous inference. However, a further refining and delimitation of these groups should be done.
(a) Diabetic condition and SHAP values for LGB model (b) Diabetic condition and SHAP values for LR model
Figure 6.5: Diabetic condition and SHAP values for LGB and LR models.
6.1.4 Ethnicity
Relatively to ethnicity, it is possible to conclude from figure 6.6 that black people has higher chance of survival
under insulin therapy for LGB model. This assumption can be formulated based on previous studies [107, 108]
where it was proven that african descendents have an exacerbated insulin response and a lower insulin sensitivity
compared to the remaining ethnicities.
The same is true for the LR model. Additionally, asian and hispanic/latino patients have some importance,
albeit lower, on survival. Nonetheless, patients with undefined ethnicity (ethnicity OTHER) tend to have a
higher risk of mortality in both models. This is one of the limitations of MIMIC-III database.
(a) Ethnicity SHAP values for LGB (b) Ethnicity SHAP values for LR
Figure 6.6: Ethnicity SHAP values
6.1.5 Glucose
Among the discretized time-series features glucose, respiratory rate, and anion gap; seem to play an important
role in the models. Glucose was already identified as a crucial feature because it is intrinsically related to many
metabolic responses and mainly to insulin administration (section 2.2.2). The built models corroborate this
assumption because it was identified that the mean and minimum glucose readings influence in the outcome.
In figures 6.7a,b is displayed that in glucose mean values there is a trade-off between patient survival and
mortality for values around 150 mg/dL in both models. Values above contribute to patients mortality. As long
as the values increase, there is a small risk escalation for LGB. In the LR model the risk is directly proportional
to glucose mean values.
Glucose min values are represented in figures 6.7c,d. In LR model, the trade-off point is positioned for
values between [80 − 100] mg/dL and risk is also directly proportional as values increase. However, in the
68
LGB model there are some significant changes (figure 6.7c). For the Glucose min values below 40 mg/dL,
representing severe hypoglycemia level 3 (table 2.2), favor patient’s mortality. Then, values between [40−100]
are favoring, in the majority, to the chance of surviving regardless the presence of few cases where mortality is
benefited although with less impact. Values above 100 mg/dL favor the risk of mortality which is accentuated
for some cases as the values rise.
(a) Glucose mean values and SHAP values for LGB model (b) Glucose mean values and SHAP values for LR model
(c) Glucose minimum values and SHAP values for LGB model (d) Glucose minimum values and SHAP values for LR model
Figure 6.7: Glucose readings vs SHAP values.
6.1.6 Respiratory Rate and Respiratory Diseases
Respiratory rate mean values (resprate mean) and median values (resprate med) are also features with
importance for the models.
Analyzing figure 6.8, there are no significant differences between mean and median values in the LR
model. The slope and the associated SHAP values are very similar for both measurements and the trade-off
mortality/survival occurs in values between [15 − 20] breaths per minute, which coincides with the typical
values in adults (table 3.5). Smaller values increase survival chances and higher values the risk of mortality.
Normally, mean and median are highly correlated measurements which may lead to the importance of only
one feature being distributed between both what seems to happen in LR model. Nevertheless, significant
69
differences in specific cases can make the difference in the model’s outcome when using both features with
less importance associated to each one, instead of just one with the totality of importance.
This conclusion can be drawn by looking at the figure 6.8 for the LGB model. In this case, both measures
values between [8 − 17] favor, but small impact, in survival chances. For resprate mean values between
[17 − 22] tend to have almost null impact. As the values increase from that range, higher is the mortality
chance, specially for values above 28. However, the same is not verified in resprate med that besides a
slight increase on mortality risk for values above 25 compared to trade-off values, mortality risk tend to remain
constant as the values increase.
(a) Respiratory Rate Mean SHAP values for LGB (b) Respiratory Rate Mean SHAP values for LR
(c) Respiratory Rate Median SHAP values for LGB (d) Respiratory Rate Median SHAP Values for LR
Figure 6.8: Respiratory Rate SHAP values
Given that diseases in respiratory system (respiratory sys) is also a feature with higher importance for
the model it is interesting to compare how respiratory rate features are related with this one. That relation
is visible in figure 6.9 for LGB model where data points are coloured by number of respiratory diseases. It is
verified that there is a propensity where as more respiratory diseases are identified in a patient, higher are the
values of resprate mean and resprate med that also represent a higher mortality risk.
The importance of these respiratory related features in the study may be, despite other complications,
due to patients resistance to insulin. Some medical studies [109–111] connect insulin resistance with lungs
70
(a) Mean respiratory rate and SHAP values for the LGB modelcoloured by number of respiratory diseases
(b) Median respiratory rate and SHAP values for the LGB modelcoloured by the number of respiratory diseases
Figure 6.9: Respiratory Rate (Respiratory Diseases) SHAP values. Respiratory rate (mean and median) andSHAP values are plotted for LGB model as in figures 6.8a,c however data points are coloured by number ofrespiratory diseases.
disorders which may be a cause for this outcome.
6.1.7 Anion Gap and Bicarbonate
Finally, mean values of anion gap (aniongap mean) follow an evolution similar to age or glucose mean.
Chances of survival decrease from the lowest value until a trade-off point (17 to LGB and between [12− 18]
to LR). Further, as the aniongap mean values increase, higher is the risk of mortality. These deductions are
possible to extract from figure 6.10.
(a) Mean anion gap and SHAP values for the LGB model (b) Mean anion gap and SHAP values for the LR model
Figure 6.10: Anion gap and SHAP values.
A study [112] concluded that higher values of anion gap and lower values of bicarbonate are associated with
insulin resistance. Although insulin resistance is not directly associated to mortality, it is worth interconnect
a possible meaningful association between them. For that, checking the bicarbonate values behaviour in the
models is important to give more value to the assumption.
Bicarbonate is not an important feature in LR model. However, in the LGB model, mean values of
71
bicarbonate (bicarbonate mean) appear as an important feature (figures 6.1 and 6.2). From figure 6.11 it is
possible to corroborate the previous assumption, check the asymptote (23 mEq/L) where chances of survival
start to increase and conclude that lower values positively affect mortality risk and higher values benefit chances
of survival.
Figure 6.11: Bicarbonate Mean SHAP values for LGB
72
6.2 Individualized Clinical Dashboards
General patterns and conclusions were taken from the analysis of all cohort but the possibility to know how
each feature is influencing each patient might potentially facilitate a personalized diagnosis.
Creating a model capable of respond to patient’s particular necessities is of high importance because each
individual case is a different case. What might be relevant for a specific patient, can have null or poor impact
in another patient. Then, an individualized medical record for each patient was created from the LGB model
by using the best performer fold from the first-day analysis.
First of all, it is necessary to analyze the model’s outputs. As mentioned above (section 4.7), a threshold-
based choice categorizes which patients are expected to die or survive from model’s outputs probabilities.
From figure 6.12, there can be found patients above a threshold (t = 0.116). These are the patients predicted
to die. On the other hand, those patients below are predicted to survive. Nonetheless, it is necessary to
understand how far are these patient’s probabilities from that threshold for not making inaccurate conclusions.
For explaining further this, two patients will be used as example. In figure 6.12 are highlighted them. One
represents a mortality case (red dot) and the other a survival prediction (blue dot). This also will serve as the
basis to explain the concept of individual clinical dashboards.
Figure 6.12: Patients mortality probability.
For this purpose, the 20 features that most affect each patient outcome were ranked and ordered in a plot
bar. For visualization simplicity, features favoring chances of survival were coloured green while those favoring
mortality risk were coloured red (see figures 6.13 and 6.14).
Along with the features, their associated value was also plotted in an adjacent graphic. The values were
represented depending on their type. For continuous features with a normal range of values, values are plotted
in accordance therewith to automatically observe if the values are below, inside or above. Binary features
appear in a dashed line where respective value associated (0 or 1) is highlighted. For the remaining features,
the values are just presented.
In figure 6.13 is presented the clinical dashboard for a patient that is expected to survive. Comparing this
figure and figure 6.1 with features ranked from patients’ cohort, it is perceptible that most of the features appear
73
in both figures however are ranked with different importance which immediately indicates an individualization
of the diagnosis.
Figure 6.13: Clinical dashboard for a patient expected to survive
In a general and quick diagnose for a patient, the number of abnormal creatinine measurements registered
(creatinine flag), mean systolic blood pressure value (sysbp mean) and minimum blood urea nitrogen
value (bun min) registered at the moment along with the patient’s age (age) are variables that most have to
concern physicians when healing the patient.
On the opposite, figure 6.14 shows a medical record for a patient expected to die. The amount of features
indicting mortality risk evidence the critical health status of the patient which should lead to redoubled care
with incidence on variables that present greater risk.
Figure 6.14: Clinical dashboard for a patient expected to die
74
Demystifying figure 6.12 with the real outcomes for the patients, coloring the dots in figure 6.15, it is
possible to conclude that above predictions were correct. It is also worth noting that the majority of patients
that actually died (red dots) were correctly classified and are above the threshold. Considering that the
testing fold is composed by 911 patients, where just 83 patients actually died and among those just 11
were misclassified being 4 quite near of the threshold, this well mirrors the predictive capacity of the model.
However, it shall not preclude that those patients weren’t under extreme poor health conditions that influenced
the outcome, but it can be strengthened by high mortality probabilities attributed to 106 patients that actually
survived.
Figure 6.15: Patients mortality probability coloured by real outcomes
By way of conclusion, these mortality probabilities coupled with individualized clinical dashboards allow
physicians to a faster diagnose and a general overview of patient’s health status. Together with clinical
knowledge, those records can support physicians’ decisions that can lead to a faster treatment which can
eventually make a difference in patients’ real outcome.
75
76
Chapter 7
Conclusions
The main purpose of this thesis was to predict mortality in patients admitted to an ICU that were under
insulin therapy. In a work mainly focused in the first 24 hours spent in the care unit, machine learning, data
sampling and feature selection techniques were tested and compared in order to achieve an optimal predictive
performance. The work extended to different time-windows within ICU in a perspective of real-time mortality
prediction. Variables that most affect patients outcome were identified and interpreted with recourse to medical
studies. From a medical decision support point of view, individualized clinical dashboards were designed from
models created.
7.1 Achievements
The work developed demonstrated that gradient boosting (LGB) models had a substantial improvement over all
the models tested presenting the following performance metrics: AUC of 91.44±1.36, AUPRC of 56.78±4.85,
sensitivity of 83.97± 2.03 and specificity of 83.68± 1.91.
Data sampling techniques used to counteract the imbalance presented in dataset have proven to be irrel-
evant as performance decreased, both for under and oversampling.
Relatively to feature selection, a resulting subset from SFS of 30 features achieved the best performance
with an AUC of 92.03 ± 1.16, an AUPRC of 57.05 ± 5.28, a sensitivity of 84.45 ± 1.74 and a specificity of
84.45 ± 1.72. Notwithstanding, relatively smaller subsets achieved quite interesting results although slightly
lower than that mentioned above.
From all tested feature selection techniques, namely RFE, SFS and RFS, the novel technique proposed in
this work (RFS) achieved the best results for smaller subsets of features (3, 5 and 7), whose were used for
external validation using data from eICU-CRD database. The best validation performance achieved an AUC
of 87.99, an AUPRC of 47.09, a sensitivity of 81.17 and a specificity of 81.26 with a subset of 7 features. This
performance with a reduced number of features and in a completely different dataset which in itself is also a
dataset made up of different sources of information (208 hospitals), present an extra point of view in favor of
the model constructed.
In a perspective of real-time prediction, when using patients’ ICU discharge information, whether dead or
77
alive, the predictive performance was higher with an AUC of 94.84 ± 0.92, an AUPRC of 78.76 ± 2.72, a
sensitivity of 87.40±1.57 and a specificity of 87.44±1.52 for 12 hours prior discharge of ICU. This is possible
due to the fact that patients began to show signs of improvement or worsening of their clinical status.
Still, the variables that most influence the models and respective mortality prediction were identified.
However, as explained and analyzed during model interpretation, each variable importance is independently
related to patients and overall importance may differ from individual importance.
Finally, this work present the construction of individualized clinical dashboards which may be an important
tool in a perspective of data-aided decisions by phisicians.
7.2 Comparison with Previous Works
Compared to [57], where cohort is restricted to diabetic patients, the model developed in this work achieved
better results. Using a small subset with just five variables (last gcs, respiratory sys, aniongap mean,
platelet min and age), results were better with an AUC of 88.8 achieved in this work compared to 78.7
achieved in the study. To remark that, 96.6% of the population in the mentioned study are under insulin
therapy, which is indicative of a good basis for comparison.
Following the approach proposed by [58] of using statistical features for time-series variables, mean stood
out among all statistical features used followed by minimum and maximum. Variance proved to be useless for
the model presenting almost null importance. Due to major differences in patients’ cohort and variables used
in the studies, performances are not comparable.
Compared to the highest performance found in literature (AUC of 92.7), a study [59] with no restrictions
in patients choice, the model have a similar performance (AUC of 92.0) besides dealing with a less predictable
cohort of patients, as it is possible to prove comparing the frequently used Simplified Acute Physiology Score
II (SAPS II) associated to each study (SAPS II: AUC=77.4 vs AUC=80.9). The performance presented in
this work was also achieved with a smaller subset (30 features) compared to the mentioned study. However,
in the study, diagnoses are not taken in consideration as in the model proposed.
7.3 Future Work
The work developed has room to evolve in a multitude of ways.
Firstly in the context of continuing the insulin-related work, it may focus on studying the influence of
differentacting types of insulin (i.e. short, intermediate and long acting) and to consider other common inputs
for this patients (e.g. dextrose boluses and glycated hemoglobin).
In a more ambitious task, predict insulin type and amount need consonant patient necessities. Predict
not only influence in glucose variability but also in the remaining variables both those identified as having a
important role in the course of this thesis, as well as those that may prove to be important.
Lastly, it is important to highlight that it is needed further validation of these models and conclusions
drawn by the part of physicians to improve and validate their interpretability.
78
Still in the medical field, the work can be extended to the entire patients’ cohort with fewer restrictions or
directed to other areas of medical interest in order to achieve different conclusions.
From another perspective, models’ construction process, from data treatment, to hyperparameter tuning,
to model interpretation, can be used in distinct studies of the most varied areas since with data associated.
79
80
Bibliography
[1] P. Domingos. The Master Algorithm: How the Quest for the Ultimate Learning Machine Will Remake
Our World. Basic Books, Inc., 2018.
[2] I. A. Berg, O. E. Khorev, A. I. Matvevnina, and A. V. Prisjazhnyj. Machine learning in smart home
control systems - algorithms and new opportunities. AIP Conference Proceedings, 1906(1):070007,
2017.
[3] T. Economist. The world’s most valuable resource is no longer
oil but data, 2017. URL www.economist.com/leaders/2017/05/06/
the-worlds-most-valuable-resource-is-no-longer-oil-but-data.
[4] E. Ahmed, I. Yaqoob, I. A. T. Hashem, I. Khan, A. I. A. Ahmed, M. Imran, and A. V. Vasilakos. The
role of big data analytics in Internet of Things. Computer Networks, 129:459–471, 2017.
[5] M. S. Mahdavinejad, M. Rezvan, M. Barekatain, P. Adibi, P. Barnaghi, and A. P. Sheth. Machine
learning for internet of things data analysis: a survey. Digital Communications and Networks, 4(3):
161–175, 2018.
[6] P. Domingos. A few useful things to know about machine learning. Communications of the ACM, 55
(10):78, 2012.
[7] J. Rowley. The wisdom hierarchy: Representations of the DIKW hierarchy. Journal of Information
Science, 33(2):163–180, 2007.
[8] N. Jothi, N. A. Rashid, and W. Husain. Data Mining in Healthcare - A Review. Procedia Computer
Science, 72:306–313, 2015.
[9] E. M. Beulah, S. N. S. Rajini, and N. Rajkumar. Application of Data mining in healthcare: A survey.
Asian Journal of Microbiology, Biotechnology and Environmental Sciences, 18(4):1001–1003, 2016.
[10] F. Guiza, J. Van Eyck, and G. Meyfroidt. Predictive data mining on monitoring data from the intensive
care unit. Journal of Clinical Monitoring and Computing, 27(4):449–453, 2013.
[11] T. J. Pollard and L. A. Celi. Enabling Machine Learning in Critical Care. ICU management & practice,
17(3):198–199, 2017.
81
[12] P. a. Fayyad, U., Piatetsky-Shapiro, G. & Smyth. Data mining and knowledge discovery in databases,
Commun. American Association for Artificial Intelligence, 17(3):82–88, 1996.
[13] R. Wirth and J. Hipp. CRISP-DM: Towards a standard process model for data mining. Proceedings
of the Fourth International Conference on the Practical Application of Knowledge Discovery and Data
Mining, (24959):29–39, 2000.
[14] A. G. Pittas, R. D. Siegel, and J. Lau. Insulin Therapy for Critically Ill Hospitalized Patients: A Meta-
analysis of Randomized Controlled Trials. Archives of Internal Medicine, 164(18):2005–2011, 10 2004.
[15] J. Clain. Glucose control in critical care. World Journal of Diabetes, 6(9):1082, 2015.
[16] M. E. McDonnell and G. E. Umpierrez. Insulin Therapy for the Management of Hyperglycemia in
Hospitalized Patients. Endocrinol Metab Clin North Am, 41(1):175–201, 2012.
[17] J. C. Preiser, J. G. Chase, R. Hovorka, J. I. Joseph, J. S. Krinsley, C. De Block, T. Desaive, L. Foubert,
P. Kalfon, U. Pielmeier, T. Van Herpe, and J. Wernerman. Glucose Control in the ICU: A Continuing
Story. Journal of Diabetes Science and Technology, 10(6):1372–1381, 2016.
[18] M. Haluzik, M. Mraz, P. Kopecky, M. Lips, and S. Svacina. Glucose control in the ICU: Is there a time
for more ambitious targets again? Journal of Diabetes Science and Technology, 8(4):652–657, 2014.
[19] F. G. Smith, A. M. Sheehy, J. L. Vincent, and D. B. Coursin. Critical illness-induced dysglycaemia:
Diabetes and beyond. Critical Care, 14(6):4–6, 2010.
[20] J. M. Boutin and L. Gauthier. Insulin infusion therapy in critically ill patients. Canadian Journal of
Diabetes, 38(2):144–150, 2014.
[21] C. De Block, B. Manuel-Y-Keenoy, L. Van Gaal, and P. Rogiers. Intensive insulin therapy in the intensive
care unit: Assessment by continuous glucose monitoring. Diabetes Care, 29(8):1750–1756, 2006.
[22] P. E. Cryer. Hypoglycaemia: The limiting factor in the glycaemic management of Type I and Type II
diabetes. Diabetologia, 45(7):937–948, 2002.
[23] J.-C. Lacherade, S. Jacqueminet, and J.-C. Preiser. An overview of hypoglycemia in the critically ill.
Journal of Diabetes Science and Technology, 3(6):1242–1249, 2009.
[24] A. E. Johnson, D. J. Stone, L. A. Celi, and T. J. Pollard. The MIMIC Code Repository: Enabling
reproducibility in critical care research. Journal of the American Medical Informatics Association, 25(1):
32–39, 2018.
[25] Agency for Healthcare Research and Quality. HCUP CCS. Healthcare Cost and Utilization Project
(HCUP), 2017. URL www.hcup-us.ahrq.gov/toolssoftware/ccs/ccs.jsp.
[26] UAGC. Homeostasis and immunity - overview. URL http://anatomysciences.com/wp-content/
uploads/2018/03/human-body-systems-human-body-systems-photos-anatomy-human.jpg.
[27] J. B. Reece and N. A. Campbell. Campbell Biology. 9th ed. edition, 2012.
82
[28] J. Torday. Homeostasis as the Mechanism of Evolution. Biology, 4(3):573–590, 2015.
[29] C. Uluseker, G. Simoni, L. Marchetti, M. Dauriz, A. Matone, and C. Priami. A closed-loop multi-level
model of glucose homeostasis. PLoS ONE, 13(2):1–23, 2018.
[30] Diabetes - The global diabetes community. Diabetes types, 2017. URL https://www.diabetes.co.
uk/diabetes-types.html.
[31] The Nobel Prize. The nobel prize in physiology or medicine, 1923. URL www.nobelprize.org/prizes/
medicine/1923/summary/.
[32] Diabetes Education Online. Types of insulin, 2018. URL dtc.ucsf.edu/types-of-diabetes/
type2/treatment-of-type-2-diabetes/medications-and-therapies/type-2-insulin-rx/
types-of-insulin/.
[33] J. C. Marshall, L. Bosco, N. K. Adhikari, B. Connolly, J. V. Diaz, T. Dorman, R. A. Fowler, G. Meyfroidt,
S. Nakagawa, P. Pelosi, J. L. Vincent, K. Vollman, and J. Zimmerman. What is an intensive care unit?
A report of the task force of the World Federation of Societies of Intensive and Critical Care Medicine.
Journal of Critical Care, 37:270–276, 2017.
[34] E. S. Moghissi, M. T. Korytkowski, M. DiNardo, D. Einhorn, R. Hellman, I. B. Hirsch, S. E. Inzucchi,
F. Ismail-Beigi, M. S. Kirkman, and G. E. Umpierrez. American Association of Clinical Endocrinologists
and American Diabetes Association consensus statement on inpatient glycemic control. Diabetes Care,
32(6):1119–1131, 2009.
[35] American Diabetes Association. Standards of Medical Care in Diabetes. Diabetes Care, 41(Supplement
1):S1–S2, 2018.
[36] P. E. Marik and R. Bellomo. Stress hyperglycemia: an essential survival response! Critical Care, 17(2):
305, 2013.
[37] J. Jacobi, N. Bircher, J. Krinsley, M. Agus, S. S. Braithwaite, C. Deutschman, A. X. Freire, D. Geehan,
B. Kohl, S. A. Nasraway, M. Rigby, K. Sands, L. Schallom, B. Taylor, G. Umpierrez, J. Mazuski, and
H. Schunemann. Guidelines for the use of an insulin infusion for the management of hyperglycemia in
critically ill patients. Critical Care Medicine, 40(12):3251–3276, 2012.
[38] S. R. Heller. Glucose Concentrations of Less Than 3.0 mmol/L (54 mg/dL) Should Be Reported in
Clinical Trials: A Joint Position Statement of the American Diabetes Association and the European
Association for the Study of Diabetes. Diabetes Care, 40(1):155–157, 2017.
[39] K. Malmberg, L. Ryden, S. Efendic, J. Herlitz, P. Nicol, A. Waldenstrom, H. Wedel, and L. Welin.
Randomized trial of insulin-glucose infusion followed by subcutaneous insulin treatment in diabetic
patients with acute myocardial infarction (DIGAMI study): Effects on mortality at 1 year. Journal of
the American College of Cardiology, 26(1):57–65, 1995.
83
[40] G. Van den Berghe, P. Wouters, F. Weekers, C. Verwaest, F. Bruyninckx, M. Schetz, D. Vlasselaers,
P. Ferdinande, P. Lauwers, and R. Bouillon. Intensive Insulin Therapy in Critically Ill Patients. New
England Journal of Medicine, 345(19):1359–1367, nov 2001.
[41] G. Van den Berghe, A. Wilmer, G. Hermans, W. Meersseman, P. J. Wouters, I. Milants, E. Van
Wijngaerden, H. Bobbaers, and R. Bouillon. Intensive Insulin Therapy in the Medical ICU. New England
Journal of Medicine, 354(5):449–461, feb 2006.
[42] W. M. Clark, W. Brooks, A. Mackey, M. D. Hill, P. P. Leimgruber, A. J. Sheffet, D. Ph, V. J. Howard,
D. Ph, W. S. Moore, J. H. Voeks, D. Ph, L. N. Hopkins, D. E. Cutlip, D. J. Cohen, J. J. Popma,
R. D. Ferguson, S. N. Cohen, J. L. Blackshear, F. L. Silver, J. P. Mohr, B. K. Lal, J. F. Meschia, and
C. Investigators. Intensive versus Conventional Glucose Control in Critically Ill Patients. New Englanc
Journal of Medicine, 360(13):11–23, 2016.
[43] J. C. Preiser, P. Devos, S. Ruiz-Santana, C. Melot, D. Annane, J. Groeneveld, G. Iapichino, X. Leverve,
G. Nitenberg, P. Singer, J. Wernerman, M. Joannidis, A. Stecher, and R. Chiolero. A prospective
randomised multi-centre controlled trial on tight glucose control by intensive insulin therapy in adult
intensive care units: The Glucontrol study. Intensive Care Medicine, 35(10):1738–1748, 2009.
[44] P. Kalfon, B. Giraudeau, C. Ichai, A. Guerrini, N. Brechot, R. Cinotti, P. F. Dequin, B. Riu-Poulenc,
P. Montravers, D. Annane, H. Dupont, M. Sorine, and B. Riou. Tight computerized versus conventional
glucose control in the ICU: A randomized controlled trial. Intensive Care Medicine, 40(2):171–181, 2014.
[45] G. Y. Gandhi, G. A. Nuttall, M. D. Abel, C. J. Mullany, H. V. Schaff, P. C. O’Brien, M. G. Johnson,
A. R. Williams, S. M. Cutshall, L. M. Mundy, R. A. Rizza, and M. M. McMahon. Intensive Intraoperative
Insulin Therapy versus Conventional Glucose Management during Cardiac Surgery. Annals of Internal
Medicine Article, 146(4):233–243, 2007.
[46] Y. M. Arabi, O. C. Dabbagh, H. M. Tamim, A. A. Al-Shimemeri, Z. A. Memish, S. H. Haddad, S. J.
Syed, H. R. Giridhar, A. H. Rishu, M. O. Al-Daker, S. H. Kahoul, R. J. Britts, and M. H. Sakkijha.
Intensive versus conventional insulin therapy: A randomized controlled trial in medical and surgical
critically ill patients. Critical Care Medicine, 36(12):3190–3197, 2008.
[47] G. Del Carmen De La Rosa, J. H. Donado, A. H. Restrepo, A. M. Quintero, L. G. Gonzalez, N. E.
Saldarriaga, M. Bedoya, J. M. Toro, J. B. Velasquez, J. C. Valencia, C. M. Arango, P. H. Aleman,
E. M. Vasquez, J. C. Chavarriaga, A. Yepes, W. Pulido, and C. A. Cadavid. Strict glycaemic control
in patients hospitalised in a mixed medical and surgical intensive care unit: A randomised clinical trial.
Critical Care, 12(5):1–9, 2008.
[48] F. M. Brunkhorst, C. Engel, F. Bloos, A. Meier-Hellmann, M. Ragaller, N. Weiler, O. Moerer, M. Gru-
endling, M. Oppert, S. Grond, D. Olthoff, U. Jaschinski, S. John, R. Rossaint, T. Welte, M. Schaefer,
P. Kern, E. Kuhnt, M. Kiehntopf, C. Hartog, C. Natanson, M. Loeffler, and K. Reinhart. Intensive
Insulin Therapy and Pentastarch Resuscitation in Severe Sepsis. New England Journal of Medicine, 358
(2):125–139, 2008.
84
[49] I. M. Mackenzie and A. Ercole. Glycaemic control and outcome in general intensive care : the East
Anglian GLYCOGENIC study. British Journal of Intensive Care, (December):121–126, 2008.
[50] M. Yang, Q. Guo, X. Zhang, S. Sun, Y. Wang, L. Zhao, E. Hu, and C. Li. Intensive insulin therapy on
infection rate, days in NICU, in-hospital mortality and neurological outcome in severe traumatic brain
injury patients: A randomized controlled trial. International Journal of Nursing Studies, 46(6):753–758,
2009.
[51] F. Bilotta, R. Caramia, F. P. Paoloni, R. Delfini, and G. Rosa. Safety and efficacy of intensive insulin
therapy in critical neurosurgical patients. Anesthesiology, 110(3):611—619, 2009.
[52] T. C. S. Investigators. Corticosteroid Treatment and Intensive Insulin Therapy for Septic Shock in
Adults. JAMA: The Journal of the American Medical Association, 303(4):341–348, 2013.
[53] S. P. Desai, L. L. Henry, S. D. Holmes, S. L. Hunt, C. T. Martin, S. Hebsur, and N. Ad. Strict versus
liberal target range for perioperative glucose in patients undergoing coronary artery bypass grafting:
A prospective randomized controlled trial. Journal of Thoracic and Cardiovascular Surgery, 143(2):
318–325, 2012.
[54] K. Giakoumidakis, R. Eltheni, E. Patelarou, S. Theologou, V. Patris, N. Michopanou, T. Mikropoulos,
and H. Brokalaki. Effects of intensive glycemic control on outcomes of cardiac surgery. Heart and Lung:
Journal of Acute and Critical Care, 42(2):146–151, 2013.
[55] D. Macrae, R. Grieve, E. Allen, Z. Sadique, K. Morris, J. Pappachan, R. Parslow, R. C. Tasker, and
D. Elbourne. A Randomized Trial of Hyperglycemic Control in Pediatric Intensive Care. New England
Journal of Medicine, 370(2):107–118, 2014.
[56] A. E. Johnson, T. J. Pollard, L. Shen, L. W. H. Lehman, M. Feng, M. Ghassemi, B. Moody, P. Szolovits,
L. Anthony Celi, and R. G. Mark. MIMIC-III, a freely accessible critical care database. Scientific Data,
3:1–9, 2016.
[57] R. S. Anand, P. Stey, S. Jain, D. R. Biron, H. Bhatt, K. Monteiro, E. Feller, M. L. Ranney, I. N. Sarkar,
and E. S. Chen. Predicting mortality in diabetic icu patients using machine learning and severity in-
dices. AMIA Joint Summits on Translational Science proceedings. AMIA Joint Summits on Translational
Science, 2017:310—319, 2018.
[58] R. Sadeghi, T. Banerjee, and W. Romine. Early hospital mortality prediction using vital signals. CoRR,
abs/1803.06589, 2018.
[59] A. E. W. Johnson and R. G. Mark. Real-time mortality prediction in the intensive care unit. AMIA ...
Annual Symposium proceedings. AMIA Symposium, 2017:994–1003, 2017.
[60] American Board of Internal Medicine. ABIM Laboratory Reference Ranges – July 2014. 6(July):3–10,
2014.
85
[61] S. Gorman, A. Hauber, M. Kroohs, E. Moritz, and B. Sanders. Laboratory Values Interpretation Re-
source Academy of Acute Care Physical Therapy – APTA Task Force on Lab Values Evolution of the
2017 Edition of the Laboratory Values Interpretation Resource by the Academy of Acute Care Physical
Therapy. pages 1–42, 2017.
[62] P. Eaton. Clinical Biochemistry Reference Ranges. 10(2):1–20, 2014.
[63] C. H. Lee and H.-J. Yoon. Medical big data: promise and challenges. Kidney Research and Clinical
Practice, 36(1):3–11, 2017.
[64] H. Kang. The prevention and handling of the missing data. Korean Journal of Anesthesiology, 64(5):
402–406, 2013.
[65] F. Cismondi, A. S. Fialho, S. M. Vieira, S. R. Reti, J. M. Sousa, and S. N. Finkelstein. Missing data in
medical databases: Impute, delete or classify? Artificial Intelligence in Medicine, 58(1):63–72, 2013.
[66] G. M. O’Reilly, P. A. Cameron, and D. J. Jolley. Which patients have missing data? An analysis of
missingness in a trauma registry. Injury, 43(11):1917–1923, 2012.
[67] H. Motoda and H. Liu. Feature selection extraction and construction. 2002.
[68] P. Sondhi. Feature construction methods : A survey. 2009.
[69] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer. Smote: Synthetic minority over-
sampling technique. J. Artif. Int. Res., 16(1):321–357, June 2002.
[70] I. Tomek. Two modifications of cnn. 1976.
[71] I. Tomek. An experiment with the edited nearest-neighbor rule. IEEE Transactions on Systems, Man,
and Cybernetics, SMC-6(6):448–452, 1976.
[72] J. Laurikkala. Improving identification of difficult small classes by balancing class distribution. In
Proceedings of the 8th Conference on AI in Medicine in Europe: Artificial Intelligence Medicine, AIME
’01, pages 63–66, 2001.
[73] E. Jones, T. Oliphant, P. Peterson, et al. SciPy: Open source scientific tools for Python, 2001–. URL
www.scipy.org.
[74] W. Mckinney. Python data analysis library, 2014–. URL pandas.pydata.org.
[75] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer,
R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duch-
esnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830,
2011.
[76] G. Lemaıtre, F. Nogueira, and C. K. Aridas. Imbalanced-learn: A python toolbox to tackle the curse of
imbalanced datasets in machine learning. Journal of Machine Learning Research, 18(17):1–5, 2017.
86
[77] T. Chen and C. Guestrin. Xgboost: A scalable tree boosting system. In Proceedings of the 22Nd ACM
SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 785–794, 2016.
[78] G. Ke, Q. Meng, T. Finley, T. Wang, W. Chen, W. Ma, Q. Ye, and T.-y. Liu. LightGBM : A Highly
Efficient Gradient Boosting Decision Tree. (Nips):1–9, 2017.
[79] A. V. Dorogush, A. Gulin, G. Gusev, N. Kazeev, L. O. Prokhorenkova, and A. Vorobev. Catboost:
unbiased boosting with categorical features. CoRR, abs/1706.09516, 2017.
[80] J. D. Hunter. Matplotlib: A 2d graphics environment. Computing In Science & Engineering, 9(3):
90–95, 2007.
[81] S. Raschka. Mlxtend: Providing machine learning and data science utilities and extensions to python’s
scientific computing stack. The Journal of Open Source Software, 3(24), Apr. 2018.
[82] S. M. Lundberg and S. Lee. A unified approach to interpreting model predictions. CoRR, abs/1705.07874,
2017.
[83] M. Wilcox. Occam’s razor and machine learning. URL https://www.teradata.com/Blogs/Occam\
OT1\textquoterights-razor-and-machine-learning#/.
[84] C. Molnar. Interpretable Machine Learning. https://christophm.github.io/interpretable-ml-book/, 2018.
[85] S. B. Kotsiantis. Supervised Machine Learning: A Review of Classification Techniques. Informatica, 31:
249–268, 2007.
[86] D. R. Cox. The regression analysis of binary sequences. Journal of the Royal Statistical Society. Series
B (Methodological), 20(2):215–242, 1958.
[87] N. d. Condorcet. Essai sur l’application de l’analyse a la probabilite des decisions rendues a la pluralite
des voix. Cambridge Library Collection - Mathematics. 2014.
[88] F. Galton. Vox populi. Nature, 75:450–451.
[89] J. Surowiecki. The Wisdom of Crowds. Anchor, 2005.
[90] L. Breiman. Bagging predictors. Machine Learning, 24(2):123–140, Aug. 1996.
[91] L. Breiman. Random forests. Machine Learning, 45(1):5–32, Oct. 2001.
[92] Y. Freund and R. E. Schapire. Experiments with a new boosting algorithm. In Proceedings of the
Thirteenth International Conference on International Conference on Machine Learning, ICML’96, pages
148–156, 1996.
[93] J. H. Friedman. Greedy function approximation: A gradient boosting machine. Annals of Statistics, 29:
1189–1232, 2000.
[94] T. K. Ho. Random decision forests. In Proceedings of the Third International Conference on Document
Analysis and Recognition (Volume 1) - Volume 1, ICDAR ’95, pages 278–, 1995.
87
[95] L. Valiant. Probably Approximately Correct: Nature’s Algorithms for Learning and Prospering in a
Complex World. Basic Books, Inc., 2013.
[96] M. Kearns. Thoughts on hypothesis boosting. Unpublished, 1988.
[97] R. E. Schapire. A brief introduction to boosting. In Proceedings of the 16th International Joint Confer-
ence on Artificial Intelligence - Volume 2, IJCAI’99, pages 1401–1406, 1999.
[98] Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning and an application
to boosting. Journal of Computer and Systems Science, 55(1):119–139, Aug. 1997.
[99] L. Breiman. Prediction games and arcing algorithms. Neural Computation, 11(7):1493–1517, 1999.
[100] J. N. van Rijn and F. Hutter. Hyperparameter importance across datasets. In Proceedings of the 24th
ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’18, pages
2367–2376, 2018.
[101] A. Swalin. Catboost vs. light gbm vs. xgboost – towards data science, Mar 2018. URL https:
//towardsdatascience.com/catboost-vs-light-gbm-vs-xgboost-5f93620723db.
[102] I. Guyon, J. Weston, S. Barnhill, and V. Vapnik. Gene selection for cancer classification using support
vector machines. Machine Learning, 46(1):389–422, 2002.
[103] S. M. Lundberg, G. G. Erion, and S. Lee. Consistent individualized feature attribution for tree ensembles.
CoRR, abs/1802.03888, 2018.
[104] B. Hayes. Demystifying confusion matrix. URL http://benhay.es/posts/
demystifying-confusion-matrix/.
[105] T. Saito and M. Rehmsmeier. The precision-recall plot is more informative than the roc plot when
evaluating binary classifiers on imbalanced datasets. PLOS ONE, 10(3):1–21, 03 2015.
[106] T. J. Pollard, A. E. W. Johnson, J. D. Raffa, L. A. Celi, and R. G. Mark. The eICU Collaborative
Research Database , a freely available multi-center database for critical care research. Scientific Data,
5:1–13, 2018.
[107] T. C. Hyatt, R. P. Phadke, G. R. Hunter, N. C. Bush, A. J. Munoz, and B. A. Gower. Insulin sensitivity
in African-American and white women: association with inflammation. Obesity (Silver Spring, Md.), 17
(2):276–282, 2009.
[108] K. Kodama, D. Tojjar, S. Yamada, K. Toda, C. J. Patel, and A. J. Butte. Ethnic differences in the
relationship between insulin sensitivity and insulin response: A systematic review and meta-analysis.
Diabetes Care, 36(6):1789–1796, 2013.
[109] S. Singh, Y. S. Prakash, A. Linneberg, and A. Agrawal. Insulin and the Lung: Connecting Asthma and
Metabolic Syndrome. Journal of Allergy, 2013:1–8, 2013.
88
[110] G. Sagun, C. Gedik, E. Ekiz, E. Karagoz, M. Takir, and A. Oguz. The relation between insulin resistance
and lung function: A cross sectional study. BMC Pulmonary Medicine, 15(1):1–8, 2015.
[111] G. Piazzolla, A. Castrovilli, V. Liotino, M. R. Vulpi, M. Fanelli, A. Mazzocca, M. Candigliota, E. Berardi,
O. Resta, C. Sabba, and C. Tortorella. Metabolic syndrome and Chronic Obstructive Pulmonary Disease
(COPD): The interplay among smoking, insulin resistance and vitamin D. PLoS ONE, 12(10), 2017.
[112] W. R. Farwell and E. N. Taylor. Serum bicarbonate, anion gap and insulin resistance in the national
health and nutrition examination survey. Diabetic Medicine, 25(7):798–804, 2008.
[113] A. Natekin and A. Knoll. Gradient boosting machines , a tutorial. 7(December), 2013.
89
Appendix A
Outlier Detection
Figure A.1: Outliers detection for anion gap variable
Figure A.2: Outliers detection for bicarbonate variable
Figure A.3: Outliers detection for chloride variable
90
Figure A.4: Outliers detection for creatinine variable
Figure A.5: Outliers detection for hemoglobin variable
Figure A.6: Outliers detection for hematocrit variable
Figure A.7: Outliers detection for MCH variable
91
Figure A.8: Outliers detection for MCHC variable
Figure A.9: Outliers detection for MCV variable
Figure A.10: Outliers detection for Platelet variable
Figure A.11: Outliers detection for RBC variable
92
Figure A.12: Outliers detection for RDW variable
Figure A.13: Outliers detection for sodium variable
Figure A.14: Outliers detection for BUN variable
Figure A.15: Outliers detection for glucose variable
93
Appendix B
Gradient Boosting Machines
Gradient booating mathematical formulation is based in the work of [113].
Function Estimation
Given a dataset (X, y)Ni=1 , where X = (X1, ..., XN ) refers to the explanatory input variables and y to output
variables, the goal is to obtain an estimation F (x) that minimizes a specified loss function ψ(y, F ):
F (X) = y
F (X) = arg minF (X)
ψ(y, F ) (B.1)
Rewriting the estimation in terms of expectations, the equivalent formulation would be to minimize the
expected loss function over the response variable Ey(ψ[y, F (X)]), conditioned on the observed explanatory
data X:
F (X) = arg minF (X)
Ex[Ey(ψ[y, F (X)])|X] (B.2)
To make the problem of function estimation trackable, the function space can be restricted to a parametric
family of functions F (X, θ):
F (X) = F (X, θ) (B.3)
θ = arg minθEx[Ey(ψ[y, F (X, θ)])|X] (B.4)
To perform the estimation, iterative numerical procedures are considered.
Numerical optimization
Given M iteration steps, the parameter estimates can be written in the incremental form:
θ =
M∑i=1
θi (B.5)
The simplest and the most frequently used parameter estimation procedure is the steepest gradient descent.
94
Given N data points (X, y)Ni=1, the objective is decrease the empirical loss function J(θ) over observed data:
J(θ) =
M∑i=1
ψ(yi, F (Xi, θ)) (B.6)
The steepest descent optimization procedure is organized as in Algorithm 5
Algorithm 5 Stepest descent optimization
1: Initialize the parameter estimates θ02: for t=1 to M do3: Compiled parameter estimate θt:
4: θt =t−1∑i=1
θi
5: Evaluate the gradient of the loss function ∇J(θ):
6: ∇J(θ) = {∇J(θi)} = [ ∂J(θ)∂J(θi)]θ=θt
7: Calculate the new incremental parameter estimate θt8: θt ← −∇J(θ)
9: Add the new estimate θt to the ensemble10: end for
Optimization in function space
The optimization in gradient boosting is performed in the function space through a parameterization of the
function estimate F in an additive functional form,
F (x) = FM (x) =
M∑i=1
Fi(x) (B.7)
where F0 is a initial guess and {fi}Mi=1 are incremental functions called ”steps” or ”boosts”.
To attain a ”greedy stagewise” approach of function incrementing with weak learners, where previously
entered terms are not readjust when new ones are added, an optimal step-size ρ has to be selected at each
iteration thus the optimization is defined as,
Ft ← Ft−1 + ρth(X, θt) (B.8)
(ρt, θt) = arg minρ,θ
N∑i=1
ψ(yi, Ft−1) + ρh(xi, θ) (B.9)
Gradient boost algorithm
In order to perform a particular GBM, it is necessary to determine the loss function ψ(y, F ) to be optimized
and choose the type of weak learner h(X, θ) to make predictions.
New weak learners h(X, θt) are chosen according to be the most parallel to the negative gradient {gt(Xi)}Ni=1
along the observed data,
gt(x) = Ey[∂ψ(y, F (x))
∂F (x)|X]F (x)=ft−1(x) (B.10)
95
The ”boost” increment ρ in the function space for the new weak learner is the most highly correlated with
−gt(x) over the data distribution
(ρt, θt) = arg minρ,θ
N∑i=1
[−gt(Xi) + ρh(xi, θ)]2 (B.11)
Gradient boosting algorithm is summarized in Algorithm 6
Algorithm 6 Gradient boosting algorithm
Inputs:
• Input data (X, y)Ni=1
• Number of iterations M
• Loss-function choice ψ(y, F )
• Weak learner h(X, θ)
Algorithm:
1: Initialize f02: for t=1 to M do3: Compute the negative gradient gt(x)4: Fit a new weak learner h(X, θt)5: Find the best gradient descent step-size ρt:
6: ρt = arg minpN∑i=1
ψ[yi, ft−1(Xi) + ρh(Xi, θt)]
7: Update the function estimate8: ft ← ft−1 + ρth(X, θt)9: end for
96
Appendix C
eICU-CRD Collaborative Research
Database
The Philips eICU program is a critical care telehealth program that delivers need-to-know information to
caregivers, empowering them to care for the patients where data utilized by the remote care givers is archived
for research purposes.
The eICU Collaborative Research Database [106] is composed with data from patients who were admitted
between 2014 and 2015 in several United States’ critical care units, in a total of 208 hospitals.
Analogously to MIMIC III database, data is de-identified to not compromise patients’ confidentiality and
safety and those are identified by codes. Hospitalid identifies each hospital in the database, uniquepid spec-
ifies an uniquely patient, patienthealthsystemstayid refers to each hospital stay and patientunitstayid
to admission into ICU.
In figure C.1 is presented an analogy between MIMIC III database and eICU-CRD database in terms of
how patients are identified.
Databases Analogy
eICU-CRD MIMIC III
Hospital Identification
Patient Identification
Hospital Admission
ICU admission
HospitalID
UniquepID
PatientHealthSystemID
PatientUnitStayID
– (single hospital)
SubjectID
HadmID
IcustayID
Figure C.1: Databases analogy
97
C.1 Inclusion Criteria
Initially, all patients were extracted to analyze the number of admissions in the hospital per patient, and for
each admission, the number of ICU stays per admission. Readmissions were discarded, i.e., from patients with
multiple admissions or with more than one ICU stay per admission, only first admission and ICU stay were
included to avoid biased assessments.
From the subset, adult (≥ 16 years old) patients that received insulin during ICU stay were selected.
Infants were discarded because they have a different metabolism and, therefore, a different glucose control
protocol in the ICU. Lastly, only patients with a length of stay equal or higher than 24 hours remained for the
study.
The number of patients extracted in each step is described in figure C.2. Cohort prior data treatment and
modeling is composed by 8379 patients.
208 hospitals in database
139 367 patients in database
132 933 patients’ firstICU stay during admission
11 999 patients that received in-sulin during ICU stay and age ≥ 16
Length of stay ≥ 24 h
8 379 Patients
Figure C.2: Inclusion criteria applied to extract the cohort used in this work.
C.2 Input Variables
Variables extracted to test the models were those resulting from feature selection process (section 5.4.3).
Last measured value of glasgow comma scale (last gcs) were extracted from Nursecharting table with
recourse to previous work developed in [106].
Diseases in respiratory system (respiratory sys) and infectious diseases (infectious sys) were ex-
tracted form diagnoses table where are identified as pulmonary and infectious, respectively. This conclusion
is taken by observing column diagnosistring and the ICD9Codes associated.
Minimum values of platelet and mean values of anion gap were extracted from lab table taking in account
the previous boundaries defined for MIMIC III (see appendix A).
For the case of the patient’s age (age), the values were extracted from patients table.
98
C.3 Data Treatment
Data treatment is independently performed for each subset of features desired for external validation, i.e,
subsets of 3, 5 and 7 features respectively.
From the cohort of 8379 patients, missing values for each variable are plotted in figure C.3. List-wise
deletion method is used to deal with missing values. Number of patients after missing data removal is
presented in table C.1.
Figure C.3: Missing values for each variable
Therefore, some variables have values outside the range delimited in previous max-min normalization used
in the dataset from MIMIC III database. Patients with those values associated are also extracted from final
cohort and are described in table C.1.
In fact, this step would not be necessary for the case of LGB model since it is a tree-based model where
values above or below the limiter values would be branched in the same group as those. However, for LR
model it might induce biased results.
For last, the final cohort for each feature subset is also presented in table C.1 along with the mortality
ratio associated.
Table C.1: Number of patients included in each feature subset
Number of FeaturesPatients
under insulintherapy
Patients aftermissing data
removal
Patients aftervalues outside
the range removal
Died/Survived(Mortality ratio)
3
8379
5515 5032 490/4542 (0.097)
5 5493 5003 483/4520 (0.097)
7 5262 4765 459/4306 (0.096)
99
Appendix D
SHAP values
SHAP values for tree-base are computed as follows [103]. A tree is represented by a vector of six variables,
tree = {v, a, b, t, r, d} and how many subsets (and their size) pass down each branch of the tree are kept
tracked through two methods, Extend and Unwind, following a path, m.
The path, m = {d, z, o, w}, representing unique features split so far is constitute by feature index, d,
whether the feature is in the set S that flow through a branch, o or not, z and the proportion of sets of a
given cardinally that are present,w.
Extend is used as a tree is traversed keeping subsets in each node. Unwind reverses the process by undo
extensions when the same feature is split twice and undo each extension of the path inside a leaf to correctly
compute features’ weights in the path.
Algorithm 7 represent SHAP values computation for tree based models.
Algorithm 7 Tree SHAP values [103]
Inputs: Algorithm:
1: φ Array of len(x) zeros2: Recurse (j,m, pz, po, pi)3: if vj 6= internal then4: for 2 to len(x) do5: w = sum(Unwind(m, i).w6: φm = φm + w(mi.o−mi.z)vj7: end for8: else if N is odd then9: h, c = xdj ≤ tj
10: iz = io = 111: k = Findfirst(m.d, dj12: if k 6= nothing then13: iz, io = (mk.z,mk.o14: m = UNWIND(m.k)15: end if16: Recurse (h,m, izrhrj , io, dj)
17: Recurse (c,m, izrcrj , 0, dj)
18: end if
100