josé miguel mourinho rodrigues - autenticação...josé miguel mourinho rodrigues thesis to obtain...
TRANSCRIPT
Data mining and modeling to predict the necessity ofvasopressors for sepsis patients
José Miguel Mourinho Rodrigues
Thesis to obtain the Master of Science Degree in
Mechanical Engineering
Examination Comittee
Chairperson: Professor João Rogério Caldas PintoSupervisor: Professor João Miguel da Costa Sousa
Co-supervisor: Doctor Susana Margarida da Silva VieiraMembers of the comittee: Professor Luís Manuel Fernandes Mendonça
June 2013
ii
Acknowledgments
I would like to thank my research advisors, professor Joao Sousa for his professionalism throughout
the whole endeavor, Dr. Susana Vieira for the constant attention and professor Luıs Mendonca for
the constant support and uplifting. For his insights, availability and readiness and help regarding the
database used thanks must also go to Andre Fialho.
A general thanks to all the people at Centro Academico Edith Stein, specialty to those who made their
thesis there, for making me know I’m not alone.
A particular thanks goes to Felipe Blanco, Pedro Viegas, Joana Peleja and Pedro Antunes for their
friendship and keeping me focused on the tasks at hand.
A special thanks goes also to Joao Campos and Paulo Araujo for their prayers and emotional support.
A final word of thanks goes to my family, for their constant support and to whom I dedicate this work.
iii
iv
Resumo
Choque e uma condicao medica de vida ou de morte que requer a administracao de medicamentos
potentes - vasopressores. A identificacao atempada destes doentes para preparacao de terapia e um
objectivo importante.
Foram usadas para o agrupamento - clustering - de doentes um conjunto de variaveis das mais fre-
quentemente amostradas e disponıveis numa unidade de cuidados intensivos. Em seguida, foi iniciado
um processo de exploracao de dados com o uso de fuzzy clustering com o algoritmo fuzzy c-means,
onde quatro clusters foram obtidos e as caracterısticas dos grupos foram analisadas. Uma relacao entre
os clusters obtidos e o uso de vasopressores foi encontrada e estes resultados foram visualizados com
a ajuda de histogramas. Primeiro, um modelo geral foi obtido. Depois, quatro modelos foram treinados
e usados numa abordagem multi-modelo, um para cada dos grupos de doentes identificados. Para a
abordagem multi-modelo, dois criterios de decisao foram utilizados. Primeiro foi usada uma decisao a
priori baseada na distancia entre os centros dos clusters e os centros das caracterısticas dos doentes,
e em seguida, uma decisao a posteriori usando cada um dos modelos e em que o valor final utilizado e
baseado na incerteza da saıda - resposta ao threshold - de cada modelo.
A abordagem multi-modelo com decisao a posteriori teve um melhor desempenho em relacao aos dois
tipos testados, e tambem teve melhores resultados que o modelo geral.
Palavras-chave: Multi-modelo, clustering, modelacao fuzzy, vasopressores
v
vi
Abstract
Shock is a life-threatening medical condition requiring the administration of powerful drugs - vasopres-
sors. Early identification of these patients is a worthy goal in order to timely prepare them for therapy.
A subset composed of the most frequently sampled and readily available variables in an intensive care
unit (ICU) was used for clustering patients. Then, a data exploration process was executed through
fuzzy clustering based on the fuzzy c-means algorithm. Four clusters were obtained and groups char-
acteristics were analyzed. A relationship between the clusters obtained and the use of vasopressors
was found out and these results were visualized with the help of histograms. First, a single general
model was derived. Then, four models were trained and used for a multi model approach, one for each
identified group of patients. For the multi-model approach, two decision criteria were used: 1) an a
priori decision based on the distance from the clusters centers to the patient characteristics, and 2) an
a posteriori decision approach where each model was used and the final outcome used is based on the
uncertainty of the output response to the threshold of each model.
The multi model approach with a posteriori decision returned a better performance of the two tested
schemes, and also performed better than the single general model approach.
Keywords: Multi-model, fuzzy clustering, fuzzy modeling, vasopressors
vii
viii
Contents
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
Resumo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiv
Nomenclature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvi
1 Introduction 1
1.1 Problem overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Data mining in medical care . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Clustering 7
2.1 Hierarchical Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Partitioning Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 Fuzzy Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3.1 Fuzzy c-means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3.2 Other fuzzy clustering algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4 Validation Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4.1 Partition Coefficient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.4.2 Partition Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.4.3 Partition Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4.4 Separation Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4.5 Xie-Beni Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4.6 Other validation measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3 Knowledge Discovery in Databases 13
3.1 Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.1.1 Fuzzy modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.1.2 Model assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.1.3 Model layouts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
ix
4 Preprocessing of MIMIC II Database 21
4.1 MIMIC II Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.2 Vasopressors subset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.3 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.4 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.5 Fluid subset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
5 Clustering of MIMIC II Database 27
5.1 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
5.1.1 Full dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5.1.2 First data reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5.1.3 Clustering with high frequency data . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5.2 Cluster Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.2.1 Clusters obtained . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.2.2 Cluster centers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5.2.3 Main features histograms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5.2.4 Demographics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.2.5 Clusters by pathology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
6 Results 37
6.1 Single-model Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
6.2 Multi-model Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
6.3 Best results comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
7 Conclusions 39
A Data Partitioning - GK Results 41
A.1 All data All features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
A.2 All Features Last point . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
A.3 5Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
A.4 LastRecords5Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
B Cluster Histograms 45
B.1 Physiological variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
B.2 Variables at ICU entrance/exit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
B.3 Output variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
B.4 Cluster distribution by pathology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
C Projections 53
References 58
x
List of Tables
4.1 Physiological variables. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.2 Static variables. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.3 Chosen data features with the low sample times. . . . . . . . . . . . . . . . . . . . . . . . 24
5.1 Number of patients and features mean values for each cluster. . . . . . . . . . . . . . . . 32
6.1 Single-model results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
6.2 Multi-model results after optimization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
6.3 Single-model and multi-model results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
B.1 Mean of physiological variables by cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
xi
xii
List of Figures
1.1 The interrelationship between SIRS, sepsis and infection. . . . . . . . . . . . . . . . . . . 2
3.1 KDD processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2 Confusion matrix. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.3 Finding ROC point AUC maximum by summing areas of the trapezoid . . . . . . . . . . . 17
3.4 Multimodel scheme with decision a priori based on cluster centers. . . . . . . . . . . . . . 18
3.5 Multimodel scheme with decision a posteriori. . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.6 Classifier values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.1 Patient records frequency plot of the first 100 entries. . . . . . . . . . . . . . . . . . . . . . 25
5.1 Validation measures for a set of 20 runs for each number of clusters between 2 and 20
(mean and variance) using the full feature set. . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.2 Validation measures for FCM clustering of the last three records for every patient and the
five most frequently sampled features. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.3 Validation measures for the reduced set, and most sampled variables . . . . . . . . . . . 31
5.4 Patients in cluster and vasopressor distribution. . . . . . . . . . . . . . . . . . . . . . . . . 32
5.5 Most frequently sampled features histograms. . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.6 Patient distribution per cluster by demographics . . . . . . . . . . . . . . . . . . . . . . . . 35
5.7 Patient distribution per cluster by pathology. . . . . . . . . . . . . . . . . . . . . . . . . . . 35
A.1 Validation measures for the GK clustering of the full data set. . . . . . . . . . . . . . . . . 41
A.2 Validation measures using the GK clustering for the last 3 records of each patient and all
features. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
A.3 Validation measures for GK Clustering of last 3 records of each patient and the 5 feature
subset data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
A.4 Validation measures for GK Clustering of the 5 feature subset data and mean of the last
three records. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
B.1 Heart rate distribution per cluster. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
B.2 Temperature distribution per cluster. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
B.3 SpO2 distribution per cluster. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
B.4 Respiratory rate distribution per cluster. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
xiii
B.5 GCS total distribution per cluster. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
B.6 Braden score distribution per cluster. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
B.7 Hematocrit distribution per cluster. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
B.8 Platelets distribution per cluster. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
B.9 White blood cells (WBC) distribution per cluster. . . . . . . . . . . . . . . . . . . . . . . . 47
B.10 Hemoglobin distribution per cluster. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
B.11 Red blood cells (RBC) distribution per cluster. . . . . . . . . . . . . . . . . . . . . . . . . . 47
B.12 Blood urea nitrogen (BUN) distribution per cluster. . . . . . . . . . . . . . . . . . . . . . . 47
B.13 Creatinine distribution per cluster. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
B.14 Glucose distribution per cluster. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
B.15 Potassium distribution per cluster. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
B.16 Chloride distribution per cluster. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
B.17 Sodium distribution per cluster). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
B.18 Magnesium distribution per cluster. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
B.19 Non-invasive blood pressure (NBP) distribution per cluster. . . . . . . . . . . . . . . . . . 49
B.20 NBP mean distribution per cluster. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
B.21 Arterial pH distribution per cluster. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
B.22 Arterial base excess distribution per cluster. . . . . . . . . . . . . . . . . . . . . . . . . . . 49
B.23 Lactic Acid distribution per cluster. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
B.24 Urine output distribution per cluster. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
B.25 Age distribution per cluster. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
B.26 Sex distribution per cluster. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
B.27 Mortality distribution per cluster. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
B.28 SOFA score distribution per cluster. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
B.29 Vasopressors administration distribution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
B.30 Pneumonia patients distribution per cluster. . . . . . . . . . . . . . . . . . . . . . . . . . . 52
B.31 Pancriatitis patient distribution per cluster. . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
B.32 SIRS patient distribution per cluster. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
C.1 projections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
xiv
Nomenclature
Abreviatures
AUC Area under curve.
BUN Blood urea nitrogen.
CE Classification Entropy.
FCM Fuzzy c-means algorithm.
GK Gustafson-Kessel algorithm.
ICU Intensive care unit.
KDD Knowledge discovery in databases.
NBP Non-invasive blood pressure.
PC Partition Coefficient.
RBC Red blood cells.
ROC Receiver operating curve.
SAPS Simplified Acute Physiology Score.
SIRS Systemic inflammatory response syndrome.
SOFA Sequential Organ Failure Assessment score.
WBC White blood cells.
XB Xie-Beni index.
Greek symbols
δi threshold of model i.
Roman symbols
C Number of clusters.
e Stopping criteria threshold.
xv
Mi Model i.
Oi Output of model i.
S Separation index.
Sc Partition index.
Subscripts
f fuzzy.
h hard.
i, j, k, l Computational indexes.
Superscripts
c Cluster number.
m Fuzziness index.
xvi
Chapter 1
Introduction
The aim of this study is to address the prediction of the administration of vasopressors in septic shock
patients in intensive care units (ICU). Particularly, our hypothesis is that a multi-model approach, based
on fuzzy clustering, to the prediction of vasopressor need leads to improved performance compared to
a one-model-fits-all approach.
First a problem overview regarding the use of the vasopressors in medical care is given, followed by
a study of data mining techniques used in medical care, then a review of related work and finally the
specific contribution of this dissertation.
1.1 Problem overview
Shock is a life-threatening medical emergency that can be defined as ”acute circulatory failure with in-
adequate or inappropriately distributed tissue perfusion resulting in generalized cellular hypoxia.” [1]
This means that end cells aren’t getting enough blood depriving them of the needed oxygen, which in
turn can lead to the tissues death, causing organ failure.
The maintenance of end-organ perfusion is critical to prevent irreversible organ injury and failure, and
this frequently requires the use of fluid resuscitation and vasopressors [2], which are medicines used to
contract blood vessels so as to increase blood pressure in critical ill patients.
Unlike most clinical conditions for which a clinical diagnosis is made before treatment is initiated, the
treatment of shock often occurs at the same time or even before the diagnostic process [2]. Thus it is
possible for patients that would not require the medication to have it administered anyway.
It had also to be considered the added problem that the procedure of administration is risky. When done
urgently, can lead to possible infections, increasing the costs in the end.
1
If a prediction is made on which patients are going to need vasopressors beforehand, costs will be re-
duced since less patients are going to need the treatment and catheter surgery will be done more timely
and so less prone to medical complications.
sepsis SIRSinfection
bacteremia
fungemia
viremia
other
other
pancriatitis
burns
trauma
Figure 1.1: The interrelationship between SIRS, sepsis and infection.
Figure 1.1 shows the interrelationship between systemic inflammatory response syndrome (SIRS), sep-
sis, and infection. Basically there is a condition marked by a general body inflamatory response, that
if it is caused by a blood-borne infection it is then called sepsis. In case of blood-borne infection it can
be called bacteremia - in the case of pneumonia for example -, fungemia, parasitemia, viremia among
others depending on the source of infection. Nonetheless, this state of inflammatory response can also
have other cause, like burns, trauma or the initial stage of pancriatitis [3] - also a focus of this study. A
common development of these conditions is shock, called septic shock in the case of a development of
sepsis, and it is that condition that is addressed in this study for the prediction of vasopressors.
1.2 Data mining in medical care
Data mining is the analysis of (often large) observational data sets to find unsuspected relationships and
to summarize the data in novel ways that are both understandable and useful to the data owner. [4]
Data mining, while recent, is not something new, as it has been used intensively and extensively in other
areas. In particular by financial institutions, for credit scoring and fraud detection; marketeers, for direct
marketing and cross-selling or up-selling; retailers, for market segmentation and store layout; and man-
ufacturers, for quality control and maintenance scheduling. [5]
2
In medical care the adoption of data mining as been more slow, as to the limited availability of data due
to privacy and legal issues [6]. But this trend is changing as, according to [5] ”In health care data mining
is becoming increasingly popular, if not increasingly essential. Some driving factors are the existence of
medical insurance fraud and abuse, the ever increasing volume of data generated by health care trans-
actions, too complex to treat by traditional methods; financial pressures to increase operating efficiency
while maintaining high level of care; and the realization that data mining can generate information that
is very useful for all parties involved in the industry, from health care insurers and providers to customers.
There are many applications of data mining in the field of health care. In [5] these applications are
grouped in four distinct areas: Evaluation of treatment effectiveness; management of health care; cus-
tomer relationship management; and detection of fraud and abuse.
The limitations of health care data mining deal with the availability and quality of the data. The data has
to be collected and integrated before data mining is attempted. There can be missing, corrupted, incon-
sistent or non-standardized data due to different formats and sources. Also the data can be unavailable
due to ethical, legal and social issues, such as data ownership [5]. Additionally, the databases can be
primarily designed for financial/billing purposes and not for medical/clinical purposes leading in lower
quality clinical data for data mining [6].
The success of health care data mining hinges on the availability of clean health care data. Possi-
ble future directions include the standardization of clinical vocabulary and the sharing of data across
organizations[5].
1.3 Related work
One particular focus of the present work is the combination of multi-models for prediction.
One application of model combination, is in weather forecasts, in particular for hydrological forecasts
of rainfall-runoff models for water discharge prediction such as in [7]. Initial studies have looked into
combining the output of different models through various methods. This study uses the simple average
method, a weighter average method and a neural network method. Also in the merging of forecast
models models were combined non-linearly using fuzzy rules as in [8]. In [9] the authors address the
question of whether multi-model combination can really enhance the prediction skill of ensemble fore-
casts less skillful models? And also if whether a multi-model can perform better than the best single
model available, assuming that there is a ‘best’ model and that it can be identified.
Another use of multi-model combination is in hybrid systems control applications in process industry
where models of different setups are used. One such application is in the field of fault diagnosis, as
3
[10]. Here a multi-model architecture is used where a fuzzy decision making approach is used to isolate
faults based on the analysis of the residuals, the difference between the model output and the identified
models output with and without faults.
Also of relevance is the machine learning field of ensemble learning as the process by which multiple
models, such as classifiers or experts, are strategically generated and combined to solve a particular
computational intelligence problem [11]. Ensemble learning is primarily used to improve the (classifi-
cation, prediction, function approximation, etc.) performance of a model, or reduce the likelihood of an
unfortunate selection of a poor one. Random forest [12] is a representative algorithm that consists of
many decision trees that vote to select class membership.
A recent paper [13] where the task of combining classifiers, from the creation of ensembles to the deci-
sion fusion is addressed, but for further study the book [14] on combining pattern classifiers, from which
the first paper was but a summary, is advised. More recently a survey of decision fusion and feature fu-
sion strategies for pattern classification is available in [15]. And finally, for further research, some related
areas are the fields of data fusion, decision fusion, ensemble learning, clustering ensembles.
1.4 Contributions
This thesis follows the work done on the vasopressor data subset of MIMICII, as compiled by Andre
Fialho, as indicated in [16]. In the present work the dataset is similar but a different approach is taken.
In particular an alternative selection of features is presented and they are used for clustering patients
with focus on visualizing the results.
Like [17] it builds upon the notion of using a multi-model architecture and combining them to create a
better performing classification algorithm. Unlike that work though, in the present work, fuzzy clustering
is used to build these models and different methodologies are presented to combine the models used.
In [6] it is said that in biomedicine clustering is normally used for microarray data analysis rather than
general health care data analysis as in that application there is very little information known about genes,
while there is more information known about health conditions and disease symptoms. Mentioning, in
particular, that clustering is often used when there is no or little information available. The present study
contributes to show that there is a place for clustering in healthcare, either as an aid to modeling, or
used alone to gain new insights and confirm medical information.
Finally, following this work a paper was submitted and accepted for presentation at the 2013 IEEE
International Conference on Fuzzy Systems (FUZZ-IEEE 2013) and for publication in the conference
proceedings published by IEEE.
4
5
6
Chapter 2
Clustering
Clustering is an unsupervised learning task that aims at decomposing a given set of objects into sub-
groups or clusters based on similarity. The goal is then to divide the data-set in such a way that objects
belonging to the same cluster are as similar as possible, whereas objects belonging to different clusters
are as dissimilar as possible [18].
Cluster analysis is primarily a tool for discovering previously hidden structure in a set of unordered
objects. In this case one assumes that a ‘true’ or natural grouping exists in the data. However, the as-
signment of objects to the classes and the description of these classes are unknown [18]. The grade of
similarity is obtained through distance functions that are part of the clustering methods and that measure
the dissimilarity of presented example cases.
Traditionally clustering techniques have been divided in two main groups: hierarchical and partitioning
[19], though more groupings can be stated, such as density or grid based clustering, for example.
2.1 Hierarchical Clustering
Hierarchical techniques organize data in a nested sequence of groups, which can be visualized in the
form of a dendrogram or tree. Based on a dendrogram one can decide on the number of clusters at
which the data are best represented for a given purpose [18].
These methods can be further divided into agglomerative and divisive methods, whether a bottom up
sequential aggregation of data point in a tree is made or rather a top-down tree division of data into
several is done made instead.
The most used algorithms for hierarchical clustering are known as single linkage (also know as nearest
neighbor) and complete linkage (furthest distance), based on the distance function used.
7
It is necessary to note that, in this type of clustering, once a data point is set as belonging to a given
cluster it is so until the end, making it more sensitive to outliers and initial conditions.
2.2 Partitioning Clustering
Another clustering type is partitioning clustering. Given a positive integer c, these algorithms aim at
finding the best partition of the data into c groups based on the given dissimilarity measure and they
regard the space of possible partitions into c subsets only [18].
One of the more known partition clustering methods is the K-means. Many other methods were de-
signed based on variations of part of this algorithm, such as K-medoids that uses points from the set as
centers (medoids), or K-medians that uses medians instead of means.
K-means [20] is one of the simplest unsupervised learning algorithms that solve the well known cluster-
ing problem. This algorithm aims at minimizing an objective function (Eq. 2.1), where dij is the distance
function (for the standard k-means this is the euclidean distance) and uij the partition matrix.
Jh(X,Uh, C) =
c∑i=1
n∑j=1
uijd2ij (2.1)
K-means is also known as ’hard’ c-means in the light of the fuzzy c-means clustering algorithm to be
described, as in classical (hard) cluster analysis each datum is assigned to exactly one cluster.
2.3 Fuzzy Clustering
If we relax the requirement uij ∈ {0, 1} that is placed on the cluster assignment in hard partitioning
approaches, so as to allow gradual memberships of data points measured as degrees in [0, 1]. That way
we thus enable the belonging of a data point to more than one cluster. The concept of these membership
degrees is built upon the notions of fuzzy sets as introduced in [21].
2.3.1 Fuzzy c-means
Fuzzy partitioning is carried out through an iterative optimization of the objective function (Eq. 2.2), with
the update of membership uij and the cluster centers cj .
8
Jf (X,Uh, C) =
c∑i=1
n∑j=1
(uij)md2ij (2.2)
where uij is the partition matrix, m is a number > 1 and dij is a distance function, for the standard FCM
case, the euclidean distance.
The problem is then the optimization of the objective function Jh (Eq. 2.2) where uij is the partition
matrix. This method was first described in [22] with m=2, and latter generalized in [23] with the formula
in its current form.
uij =1∑c
l=1(d2ij
d2lj)
1m−1
(2.3)
ci =
∑nj=1 uijxj∑nj=1 uij
(2.4)
2.3.2 Other fuzzy clustering algorithms
Like k-means, there are many possible variations which lead to other algorithms that can take advantage
of some particular case known in advance, like the presence of some given shapes (elipsoids, lines), or
noise.
One notable variation is Gustafson-Kesseln (GK) clustering algorithm, where the distance function is
changed to the Mahalanobis distance in order to detect clusters of different size and orientation. This
allows it to extract more information but makes the algorithm more sensitive to initialization and with
higher computational demands.
Also of note is kernel based fuzzy clustering where the distance function is further modified to include
non-vectorial data such as sequences, trees or graphs.
2.4 Validation Measures
Usually the number of (true) clusters in the given data is unknown in advance. However, using the
partitioning methods one is usually required to specify the number of clusters c as an input parameter.
Estimating the actual number of clusters is thus an important issue [18].
The name given to these criteria is not consistent, given the different names attributed to them in litera-
ture. In particular they are called validation measures, validity criteria, evaluation measures, or validity
9
indices. As with [24] the term used in this work will be validation measures.
Such measures can be used to evaluate quantitatively the clustering quality and to compare algorithms
one with another. They can also be applied to compare the results obtained with a single algorithm,
when the parameter values are changed. In particular they can be used in order to select the optimal
number of clusters: applying the algorithm for several c values, the value c∗ leading to the optimal de-
composition according to the considered criterion is selected [18].
External clustering validation and internal clustering validation are the two main categories of clustering
validation. The main difference is whether or not external information is used [25]. Internal validation
measures only rely on information in the data , and are therefore applicable to situations were there is
no previous knowledge is known, such as a true number of clusters or previously known classes and so
are more suitable for an exploratory knowledge discovery process such as the one in this study and are
less database specific.
In literature, a number of internal clustering validation measures for crisp clustering have been proposed,
such as Dunn index [22], silhouette index [26] or Davies-Bouldin index [27]. But starting with Bezdek
in 1975 [23], there were others measures proposed specifically for fuzzy clustering that use information
about the partition matrix and other fuzzy clustering parameters. Also, using a validity measure intended
for crisp clustering in fuzzy clustering would make the results dependent on some kind of defuzzyfication
scheme.
For these reasons we focused on internal and fuzzy validation measures alone, some of which are now
presented:
2.4.1 Partition Coefficient
Partition coefficient measures the amount of ”overlapping” between clusters. It is defined by Bezdek as
follows [23]:
PC =1
n
c∑i
n∑j
u2ij (2.5)
2.4.2 Partition Entropy
The validation measure partition entropy computes the entropy of the obtained membership degrees,
and must be minimized. It measures the fuzziness of the cluster partition only, akin to the Partition
Coefficient and thus defined [23]:
PE = − 1
n
c∑i
n∑j
uij log uij (2.6)
10
2.4.3 Partition Index
Partition index is the ratio of the sum of compactness and separation of the clusters. It is a sum of
individual cluster validity measures normalized through division by the fuzzy cardinality of each cluster
[28] :
Sc(c) =
c∑i=1
∑Nj=1(µij)
m||xj − vi||2
Ni
∑ck=1 ||xj − vi||
2 (2.7)
2.4.4 Separation Index
On the contrary of partition index Sc, the separation index uses a minimum-distance separation for
partition validity [28]:
S(c) =
∑Ci=1
∑Nj=1(µij)
2||xj − vi||2
Nmin i,k||vk − vi||2(2.8)
2.4.5 Xie-Beni Index
Also for fuzzy clustering and with a widespread use is the Xie-Beni index, which aims to quantify the
ratio of the total variation within clusters and the separation of clusters. [29]:
XB(c) =
∑Ci=1
∑Nj=1(µij)
m||xj − vi||2
Nmin i,j ||xj − vi||2(2.9)
A better clustering is obtained by minimizing XB(c). However, it was shown that Xie-Beni index has
a problem with an high number of cluster where the behavior of the index is shown to be decreas-
ingly monotonic. This can be addresses by calculating one of the corrected xie-beni index available
in literature, though in practical terms only rather small number of clusters is often sought and so the
uncorrected Xie-Beni index is still used, given its widespread use.
2.4.6 Other validation measures
Other possible validation measures that weren’t used in this study, but are used often enough to at least
name them are Fukuyama-Sugeno index [30] and Gath and Geva’s validation measures fuzzy hypervol-
ume and partition density [31]. These last two, in particular, are more suited to GK clustering, for they
require the calculation the covariance matrix, which is already part of the GK clustering algorithm.
11
12
Chapter 3
Knowledge Discovery in Databases
Knowledge discovery in databases (KDD) is the nontrivial process of identifying valid, novel, potentially
useful, and ultimately understandable patterns in data [32].
Figure 3.1: KDD processes. [32] [33]
According to Fayyad in [32] there are essentially five steps in the KDD process. These are selection,
preprocessing, transformation, data mining and interpretation (also called evaluation). For an engineer-
ing background (systems modeling), these steps could be called data acquisition, data preprocessing,
feature selection, modeling and interpretation without significant loss of meaning as in [33] and shown
in Figure 3.1. The overall process of finding and interpreting patterns from data involves the repeated
application of the these steps, now explained in more detail:
1. Data acquisition / selection - Compromises the creation or selection of a target dataset, or par-
ticular variables on which to perform data discovery on. It is an important task as the quality of the
13
data will impact on the quality of the data discovery process.
2. Data preprocessing - This step focuses on the cleaning or preprocessing of the data. In particular
it deals with noise or outliers removal, noise modeling, handling missing data fields and accounting
for time sequence information.
3. Feature selection / Transformation - Includes data reduction and projection so as to represent
data in a more ready to treatment for data mining. It tries on one hand to select a fewer variables
that represent the data, while one the other, eliminating redundancy that can impair the data mining
process.
4. Modeling / Data mining) - In this step the selection of a data mining method is made depending
on the goal of the data. Six common tasks are:
• Anomaly detection (Outlier/change/deviation detection) - The identification of unusual data
records, that might be interesting or data errors that require further investigation.
• Association rule learning (Dependency modeling) – Searches for relationships between vari-
ables, sometimes referred to as market basket analysis.
• Clustering – is the task of discovering groups and structures in the data that are in some way
or another ”similar”, without using known structures in the data.
• Classification – is the task of generalizing known structure to apply to new data.
• Regression – Attempts to find a function which models the data with the least error.
• Summarization – providing a more compact representation of the data set, including visual-
ization and report generation.”
5. Interpretation - In this step, the patterns obtained through data mining are evaluated to see if their
interesting or not, thus it can also be called evaluation. The duty is to represent the result in an
appropriate way so that it can be examined in a thoroughly way. If the pattern is not interesting,
the cause for it has to be found out, and more attempts can be made and some previous steps
redone.
3.1 Modeling
The choice of the modeling technique to be used may depend on many factors, including the source of
the data set and the values that it contains. Methods based on fuzzy systems inherit model transparency
and enjoy good function approximation properties. For this reason fuzzy modeling was used in this works
along with fuzzy clustering.
3.1.1 Fuzzy modeling
Fuzzy modeling is a tool that allows approximation of nonlinear systems when there is little or no previous
knowledge of the problem to be modeled [34].
14
This approach provides a transparent, non-crisp model, and also possibilitates a linguistic interpretation
in the form of rules and logical connectedness. These are used to establish relations between the de-
fined features in order to derive a model.
A fuzzy classifier contains a rule base consisting of a set of fuzzy if-then rules together with a fuzzy
inference mechanism. These models ultimately classify each instance of the dataset as pertaining with
a certain degree, to one of the possible classes defined for the specific problem being modeled.
As suggested in [35], a discriminant method was used in this work, where the classification is based on
the largest discriminant function. In this method, a separate discriminant function dc(x) is associated
with each class wc, with c = 1, ..., C. The discriminant functions can be implemented as fuzzy inference
systems. Here, we use Takagi-Sugeno (TS) fuzzy models [36] with which, each discriminant function
consists of rules of the type:
Rule Rci : If x1 is Ai1 and ... and xM is AiM
then dc(x) = f ci , i = 1, 2, ...,K,
where f ci is the consequent function for rule Rci . In these K rules, the index c indicates that the rule
is associated with the output class c. Note that the antecedent parts of the rules can be for different
discriminants, as well as the consequents. The classifier assigns the class label corresponding to the
maximum value of the discriminant functions, i.e.
maxc dc(x). (3.1)
3.1.2 Model assessment
In describing the performance of binary classifiers, the accuracy of classification cannot be considered
alone [37]. Both the sensitivity, or hit rate, and specificity, or true rejection rate, must also be analyzed.
In medical diagnosis and in the machine learning community, one of the methods for combining these
two measures into the evaluation task is the analysis of the area under the ROC curve (AUC), with ROC
being the receiver operating curve, a graphical plot which illustrates the performance of a binary clas-
sifier system as its discrimination threshold is varied. ROC curves allow a visualization of the trade-off
between hit rates and false alarm rate of classifiers [38].
In this work, the measures used to assess the quality of the obtained classifier were the AUC, specificity
(3.2), sensitivity (3.3) and accuracy (3.4), which can be calculated as:
specificity = 1− FP rate (3.2)
15
sensitivity = TP rate (3.3)
accuracy =TP + TN
P +N(3.4)
where,
FP rate =FP
N(3.5)
TP rate =TP
P(3.6)
and P are positive, N negatives, TP true-positives, TN true-negatives, and FP false-positives.
All which are part of a matrix widely used in pattern recognition called the confusion matrix, and which is
used to represent errors in assigning classes to observed patterns in which the ijth element represents
the number of samples from class i which were classified as class j.
Sensitivity and specificity are statistical measures of the performance of a binary classification test, also
known in statistics as classification function. Sensitivity (also called the true positive rate, or the recall
rate in some fields) measures the proportion of actual positives which are correctly identified as such
(e.g. the percentage of sick people who are correctly identified as having the condition). Specificity mea-
sures the proportion of negatives which are correctly identified as such (e.g. the percentage of healthy
people who are correctly identified as not having the condition, sometimes called the true negative rate).
These two measures are closely related to the concepts of type I and type II errors. A perfect predictor
would be described as 100% sensitive (i.e. predicting all people from the sick group as sick) and 100%
specific (i.e. not predicting anyone from the healthy group as sick).
Optimization of the AUC
In [16] the optimized discriminating threshold is obtained through maximization of the the AUC. The AUC
for each value of the threshold is calculated through the approximation to the trapezoid by adding the
areas of the triangles and square shown in Figure 3.3, where an example ROC curve is shown as well
as the calculation of the AUC where its value is maximum. So a good value of AUC would be that with a
big value for TP rate while having a small one for FP rate, thus maximizing sensitivity and specificity.
Another option would be to select different weights for specificity and sensitivity, as used in [17]. It was
found that giving an equal weight to specificity and sensitivity lead to the same results.
16
p n
P N
TruePositives
FalsePositives
TrueNegatives
FalseNegatives
Y
N
True class
Hypothesized class
Figure 3.2: Confusion matrix.
0 0.5 10
0.2
0.4
0.6
0.8
1
FP rate
TP
rate
(a) ROC curve
0 0.5 10
0.2
0.4
0.6
0.8
1
FP rate
TP
ra
te
3
21
(b) AUC approximation
Figure 3.3: Finding ROC point AUC maximum by summing areas of the trapezoid: 1, 2 and 3.
3.1.3 Model layouts
When modeling a system, one can opt for a one-model-fits-all approach where a single general model
is built with all available training data, or for a multi-model solution. In the second one, there is the
additional problem of selecting which model to use for each data point. This paper proposes to use
the division of data obtained from unsupervised fuzzy clustering to obtain the multi-models. Thus, the
number of models is equal to the number of clusters.
The multi-model approach is compared to a single model, which was derived using the whole dataset.
The model that maximized the AUC was chosen as the best one, in order to balance specificity and
sensitivity.
17
A priori decision
The a priori decision scheme is based on cluster similarity. The criterion used for the choice of the model
was based on the distance of each point to the cluster center. Thus, as a point is closer to a given cluster
the output of the model is passed as shown in Figure 3.4, where M1, M2, ..., and Mn are the models.
For this case, 4 clusters were used and so we ended up with 4 models.
Figure 3.4: Multimodel scheme with decision a priori based on cluster centers.
A posteriori decision
In this scheme, the multi-model approach implements an a posteriori decision. Figure 3.5 shows the
proposed layout.
Figure 3.5: Multimodel scheme with decision a posteriori.
The decision is given by the model that has a higher difference between the model output Oi and a
threshold δi, see (3.7). This threshold is optimized for each model in order to maximize the AUC. This is
based on the hypothesis that a point further apart from the threshold is more accurate, as there is less
uncertainty.
Figure 3.6 shows the values taken by the output of the model prior to classification. Here, one can see
that the values are not 0 and 1 but take the values from -0.6 to 0.8. A threshold has to be applied to turn
these from a real value to a binary one. The value chosen was that which maximized the AUC.
18
0 1 2 3 4 5 6
x 104
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
(a) range of classifier values (b) Classifier threshold
Figure 3.6: Classifier values.
Latter, to choose which model to use, a similar idea was taken. After optimization the value chosen was
0.17 (Figure 3.6 b)). The criterion was based on the distance to the threshold (3.7), choosing at each
point the model in which the output is further apart to the threshold at which the class changes,
maxi|Oi − δi| (3.7)
where the threshold used was δi = 0.17, with Oi being the output of model i.
Another criterion tested was based on the distance from each point to the extremes of classification [0,1]
and choosing that which the difference from the point to the extreme was lowest.
19
20
Chapter 4
Preprocessing of MIMIC II Database
One of the essential parts of data mining is the access to a database. In real life there can be missing
data and other obstacles that one must tackle prior to fully use the dataset.
In this work we have data of a medical database known as MIMIC II, which will be introduced. An
overview of the actual subset used will be given along with information regarding the necessary prepro-
cessing and feature selection that was required prior to use the data.
4.1 MIMIC II Database
The MIMIC II (Multi-parameter Intelligent Monitoring in Intensive Care) Clinical Database is a database
composed of detailed information on more than 25,000 intensive care unit patients. It was initially com-
posed of the data from adult patients admitted at ICUs at Boston’s Beth Israel Deaconess Medical
Center during the period 2001-2007, an academic medical center in Boston with 620 beds, 77 of which
for critical care [39].
It is composed of two parts: The first is the MIMIC II Waveform Database, that includes bedside monitor
trends and waveforms and that is freely available. The second is the MIMIC II Clinical Database, that
includes all other elements of MIMIC II like patient demographics, physiological measures, list of pro-
cedures, medications, lab tests, fluid balance, notes and staff reports; and that is available to qualified
researchers who obtain human subjects training and under the terms of a data use agreement concern-
ing issues of human research and privacy. This information can be queried or downloaded from the
database website.1
Preprocessing was undertaken to improve data quality. Missing data was imputed consistent with the
accepted last value carried forward method [40, 41, 42]. Outliers were addressed using the inter-quartile
1http://physionet.org/mimic2/
21
range method [43]. Normalization of the data used the min-max procedure. Finally, data was aligned
with a gridding approach based on heart rate sampling frequency [44].
4.2 Vasopressors subset
In the clinical practice, it is common to attribute to each patient a series of ICD-9 codes. This is a medical
coding in which every health condition (sign, symptom or disease) is assigned a unique code, grouped
by similar conditions. These codes were used to select two specific group of patients: pancreatitis2 or
pneumonia3. These are two conditions prone to the development of systemic shock and that end up
requiring the possible use of fluids and vasopressor agents.
For the selection of patients, a set of variables usually obtained in the ICU by non invasive means were
chosen, along with the indication of times (in hours) between samples, as indicated in Table 4.1.
# Variables (units) 95 % CI for time between samples (hours)1 Heart Rate (beats/min) 0.85 - 0.862 Temperature (C) 2.64 - 2.683 SpO2 (%) 0.85 - 0.874 Respiratory Rate (breaths/min) 0.90 - 0.915 GCS Total 3.30 - 3.336 Braden Score 7.28 - 7.467 Hematocrit (%) 13.0 - 13.58 Platelets (cells/L) 16.6 - 17.09 WBC - White Blood Cells (103/mL) 17.3 - 17.8
10 Hemoglobin (g/L) 16.9 - 17.511 RBC - Red Blood Cells (106/mL) 17.4 - 17.912 BUN - Blood urea nitrogen (mg/dL) 14.7 - 15.213 Creatinine (mg/dL) 14.7 - 15.114 Glucose (mg/dL) 08.7 - 09.015 Potassium (mEq/L) 10.6 - 10.916 Chloride (mEq/L) 14.7 - 15.117 Sodium (mEq/L) 13.8 - 14.318 Magnesium (mg/dL ) 14.1 - 14.619 NBP - Non-invasive blood pressure (mmHg) 1.13 - 1.1620 NBP Mean (mmHg) 1.14 - 1.1721 Arterial pH 6.45 - 6.7322 Arterial Base Excess (mEq/L) 6.57 - 6.8623 Lactic Acid (mg/dL) 13.0 - 14.4624 Urine Output (mL) 1.17 - 1.19
Table 4.1: Physiological variables.
Also a series of variables connected to the patient information obtained on ICU admission were grouped,
as indicated in Table 4.2.
2Pancreatitis ICD-9 codes: 577.0 ; 577.1 ; 577.2 ; 577.8 ; 577.9 ; 579.43Pneumonia ICD-9 codes: 003.22 ; 020.3 ; 020.4 ; 020.5 ; 021.2 ; 022.1 ; 031.0 ; 039.1 ; 052.1 ; 055.1 ; 073.0 ; 083.0 ; 112.4 ;
114.0 ; 114.4 ; 114.5 ; 115.05 ; 115.15 ; 115.95 ; 130.4 ; 136.3 ; 480.0 ; 480.1 ; 480.2 ; 480.3 ; 480.8 ; 480.9 ; 481 ; 482.0 ; 482.1; 482.2 ; 482.3 ; 482.30 ; 482.31 ; 482.32 ; 482.39 ; 482.4 ; 482.40 ; 482.41 ; 482.42 ; 482.49 ; 482.8 ; 482.81 ; 482.82 ; 482.83 ;482.84 ; 482.89 ; 482.9 ; 483 ; 483.0 ; 483.1 ; 483.8 ; 484.1 ; 484.3 ; 484.5 ; 484.6 ; 484.7 ; 484.8 ; 485 ; 486 ; 513.0 ; 517.1
22
# Variable Remark1 Patient ID2 Age at ICU admission3 Sex 0 if female, 1 if male4 Mortality 1 if patient died while in the ICU5 Hospital time stay6 ICU time stay7 SAPS score at ICU admission8 SOFA score at ICU admission9 Vasopressor administration 1 if it was administered
Table 4.2: Static variables.
For these records, there was also a binary variable that recorded if, on that each instant, a given patient
had vasopressors being administered, or not. This served as an output variable for the prediction of
vasopressor need.
4.3 Preprocessing
As with any real database, there were a few steps that had to be made prior to use the data. In particular
due to the presence of missing data, outliers and data synchronization.
In order to tackle differences in frequency of collection, the Heart Rate signal was used as a template
variable to align the remaining variables, since it was the most frequently measured variable. This pro-
cess was presented in [44] with more detail. In particular the values were interpolated and chosen the
points that were in sync with this template variable.
Regarding missing data, the chosen procedure was to impute recoverable missing segments by cubic
interpolation.
Additionally, In the de-identification step of the MIMIC II Database, patients whose age was higher than
90 were set the value 200, as visible outlier. We then changed them to 92 so as not to weight too much
on the data mining processes.
4.4 Feature Selection
Normally for the task of data mining one usually focuses on a subgroup of variables rather then the
complete set. This can be for various reasons. Too many variables can lead to redundancy which can
lower the prediction performance. Also it can be for computational purposes as less variables leads to
23
decrease in time of computation, and there is often quite a lot of data in data mining applications.
For the selection of which variables to discard on feature selection one could apply a process of discov-
ery of which variables, separately or in group, have a higher predictive power and so avoid taking out the
most important ones and in turn diminishing the prediction rate. In [16] several techniques were used
select these variables using a similar dataset to the problem of vasopressor prediction. In particular both
a bottom-up and and a top-down tree search approaches and an ant colony optimization method were
used without significant loss of performance.
In this work this step of selecting a optimum subset of features was not done. We focused instead on a
particular subset of 5 variables that were more frequently sampled, as shown in Table 4.3. These were
also the variable subset used for the clustering process.
ID Variables (units)1 Heart Rate (beats/min)3 SpO2 (%)4 Respiratory Rate (breaths/min)19 NBP - Non-invasive blood pressure (mmHg)24 Urine Output (mL)
Table 4.3: Chosen data features with the low sample times.
Since the preprocessing stage required all variables to be synchronized to the heartbeat rate [44], in
those features where sample time was the largest, thus less frequently sampled, some of the records
were obtained through interpolation and so they were more prone to errors. These features had a sam-
ple time similar to the template variable and so had their values less influenced by the interpolation
process.
Also, there were initially 1489 patients, but a significant number of patients with too few records was
found, as it can be seen in Figure 4.1. So the removal of those with less than three was sought, ending
with 1220 patients, 80 of which with pancriatitis and 430 with pneumonia.
24
0 20 40 60 80 100 1200
50
100
150
200
250
# of samples
# o
f occurr
ences
Figure 4.1: Patient records frequency plot of the first 100 entries.
4.5 Fluid subset
Another data set was made considering a selection of patients that were given fluids, one of the possible
treatments for shock and usually the step done before the administration of vasopressors. Thus the
problem would be the prediction of the use of vasopressor drugs on patients that were given fluids for
the treatment of shock. In this dataset there are more patients (2944 in total), less features are used and
a reduced number of vasopressors are included.
25
26
Chapter 5
Clustering of MIMIC II Database
Once the data mining method was selected, the steps to the actual knowledge retrieval was taken,
through the clustering of the MIMIC II vasopressors data subset.
This chapter is divided as follows: First a pattern in the data that gave us an indication of the number of
clusters that the data could partitioned into was searched for. For this goal a few validation measures
were calculated for a sequence of cluster numbers, as discussed in section 2.4. Then the data was
progressively reduced to a smaller selection that gave clearer results. Finally the clustering solution was
evaluated through the analysis of cluster histograms and cluster centers.
5.1 Clustering
On Chapter 2 clustering was introduced, and fuzzy c-means (FCM) clustering method was described in
more detail. This was the method chosen for the reasons indicated, in particular because it is a stan-
dard clustering method, simple and easily extended. It makes sense to start initially with FCM and then
choose a more advanced algorithm to tackle any specific hindrance if obstacles arise.
Another clustering algorithm was also used initially, Gustafson-Kessel (GK), and tests were carried out
for all the mentioned validation measures. As there were no improvements compared to FCM, it was
dropped in favor of the latter. If the reader is interested, these results are available in Annex A. GK was
also found to be computationally more expensive.
In all the tests, FCM algorithm was used with the parameters m=1.8 and minimum amount of improve-
ment, e=0.01. In [45] a fuzzyness parameter between 1.5 and 2.5 is advised. In this work the value of
2.0 was initially used but changed to 1.8 because in the initial steps the clusters centers were found too
near to each other. One can look at the relative position of the calculated centers in Annex C.
27
FCM algorithm requires the user to indicate the number of clusters to partition the data into. So, for a
series of runs, validation measures were computed for a given range of number of clusters so as to find
this cluster number, as addressed in section 2.4.
5.1.1 Full dataset
First using the complete data set, without any feature selection, clustering was attempted with the 24
features and all patient records. Validation measures were calculated for a batch of 20 runs for the
number of clusters ranging from 2 to 10.
Figure 5.1 shows the mean values for the validity measures Partition Coefficient, Classification Entropy,
Partition Index, Separation Index, and Xie-Beni Index in a batch of 20 runs for each number of clusters
tested. The trends were monotonic, there was no clear meanimum for the graphs of partition coefficient,
partition index, separation index and Xie-Beni index, while there was no maximum for classification en-
tropy.
Therefor, the results presented no clear indication of a number of groups in the data.
5.1.2 First data reduction
Work was proceeded by reducing the dataset used for clustering, using a smaller subset of the whole
data instead. For this goal we took the mean of the last three records for each patient with more than 3
recorded samples. This was one of the reasons why entries with less than three records were discarded
initially, as stated in section 4.4. We ended up with one point per patient, thus 1220 points instead of the
whole dataset were used.
The results are shown in Figure 5.2, where it can be seen that the results were relatively high for the
low number of clusters, especially in Figure 5.2 c) though it could be enough to identify 5 as a possible
candidate for the number of clusters, in Figure 5.2 a) and b).
Here the validation measures Partition Coefficient and Classification Index were now dropped, as there
were no notable difference in the following plots, either before or after data reduction.
5.1.3 Clustering with high frequency data
Nonetheless, work was proceeded by further reducing the data set, by reducing, the number of features
used. In particular they were reduced to those with the least sample times, as described in Table 4.3 of
28
2 4 6 8 100.1
0.2
0.3
0.4
0.5
0.6
Nc
PC
(a) Partition coeffiecient (PC)
2 4 6 8 10
0.8
1
1.2
1.4
1.6
1.8
2
2.2
2.4
Nc
CE
(b) Classification entropy (CE)
2 4 6 8 100
500
1000
1500
2000
2500
3000
3500
4000
4500
Nc
Sc
(c) Partition (Sc) index
2 4 6 8 100
20
40
60
80
100
120
Nc
S
(d) Separation (S) index
2 4 6 8 100.2
0.4
0.6
0.8
1
1.2
1.4
1.6
Nc
XB
(e) Xie-Beni (XB) index
Figure 5.1: Validation measures for a set of 20 runs for each number of clusters between 2 and 20(mean and variance) using the full feature set.
Section 4.4, each sampled approximately once every hour.
The explanation is as follows. Since the preprocessing stage required all variables to be synchronized
to the heartbeat rate [44], in those features where sample times were largest (thus less frequently sam-
pled) some of the records were obtained through interpolation in the alignment process and so were
more prone to errors.
Again, the validity measures Partition Index, Separation Index, and Xie-Beni Index were obtained, with
the means and variance shown in Figure 5.3. Now there was a clear minima in Figure 5.3 a) and b)
29
2 4 6 8 100
0.5
1
1.5
2
2.5x 10
4
Nc
Sc
(a) (SC) index
2 4 6 8 100
2
4
6
8
10
12
14
16
18
Nc
S
(b) (S) index
2 4 6 8 100.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Nc
XB
(c) Xie-Beni (XB) index
Figure 5.2: Validation measures for FCM clustering of the last three records for every patient and thefive most frequently sampled features.
though not quite so clear in c).
30
2 4 6 8 106
7
8
9
10
11
12
13
Nc
Sc
(a) Partition (Sc) Index
2 4 6 8 106.5
7
7.5
8
8.5
9
9.5
10x 10
−3
Nc
S
(b) Separation (S) Index
2 4 6 8 100.5
1
1.5
2
2.5
3
3.5
4
Nc
XB
(c) Xie-Beni (XB) Index
Figure 5.3: Validation measures (mean and variance) for the reduced set, and most sampled variables.
5.2 Cluster Evaluation
Now that a number of clusters was obtained the next step was to find out which information was stored
in these four clusters. For this goal, each point was assigned to the cluster were the fuzzy membership
was higher.
5.2.1 Clusters obtained
In order to evaluate the clusters obtained, a more careful study of the clusters was made. Figure 5.4 a)
shows an histogram of the patients divided by cluster where it can be seen that that the patients are well
divided among the clusters. Figure 5.4 b) shows that despite the seemed evenly distribution of patients,
a careful look at which of them have been administered with vasopressors, our output variable, we see
that an higher amount of patients with vasopressors are located in the forth cluster while remarkably
less on the first, while average on the second and third clusters.
31
1 2 3 40
50
100
150
200
250
300
350
cluster
fre
qu
en
cy
(a) Patients in cluster
1 2 3 40
50
100
150
200
250
300
350
cluster
fre
qu
en
cy
with
without
(b) patients with and without vasopressors
Figure 5.4: Patients in cluster and vasopressor distribution.
5.2.2 Cluster centers
The centers of the clusters were calculated, as shown in Table 5.1, where we first started to analyze the
static variables, those that were acquired on ICU entrance or exit and that remained unchanged.
cluster patients age sex{0,1} mortality{0,1} SOFA vassopres.{0,1}1 332 61.78 0.63 0.37 2.30 0.152 278 57.97 0.48 0.45 2.44 0.283 298 66.06 0.54 0.50 2.41 0.394 312 61.84 0.53 0.47 2.62 0.53
Table 5.1: Number of patients and features mean values for each cluster.
Again it could be seen that the clusters were evenly distributed in terms of patients, ages and sexes.
Cluster 3 had slightly higher mortality which could be explained by the slightly higher age mean and a
considerable amount of patients with vasopressors administred - 40%. Cluster 4, the highest in terms of
vasopressor administration had the second highest value of mortality, then cluster 2 and finally cluster 1,
with the least amount of patients with vasopressors. Thus it is confirmed that shock is a leading cause
of death in an ICU.
5.2.3 Main features histograms
If we turn instead to the relationship between cluster assignment and feature values in Figure 5.5 more
insightful information can be retrieved. The mean of the last three values of each feature of each patient
was obtained and histograms describing feature distribution by clusters were drawn.
In terms of heart rate, in Figure 5.5 a), there is a significant difference between clusters 2 and 3 and only
a slight difference between clusters 1 and 4. It can be seen that patients with mean to high heart rate
are assigned to cluster 2 while to with mean to low are assigned to cluster 3.
For SpO2% feature, the higher difference is between clusters 2 and 4, the first with mean low values,
the second with mean high SpO2%, while mean values for the remain clusters 1 and 3.
32
If we change for the breaths/min feature, it can be seen that the while patients with mean high values of
respiratory rate are assigned to the first clusters 1 and 2, for cluster 4 the assignment is done for patients
with mean-low rate.
As for non invasive blood pressure (NBP), cluster 1 has patients with mean-high NBP, while clusters 3
and 4 have assigned to them patients with mean-low NBP.
Finally in Figure 5.5 e) we can see that low values of urine output are more common in the clusters with
higher amount of patients with administered vasopressors, clusters 3 and 4, while on cluster 1, where
there is a lower proportion of patients with vasopressors, the mean is around 0.5 with equal amount of
patients with higher or lower values.
Main findings from Figure 5.5 follow that:
i Patients with the need of vasopressors are assigned in decreasing order to cluster 4, 3 and 2, which
common characteristic is a low urine output. Thus this variable alone would be enough to indicate
risk.
ii Since cluster 4 has the highest number of patients with vasopressors one can say that are at risk
patients with the need of vasopressors to have low urine output and non-invasive blood pressure,
along with a low respiratory rate that has high SpO2%, independent of the hear rate.
iii Cluster 1 has assigned to it most of the patients without need of vasopressores. This group char-
acteristics are as follows: patients with high blood pressure and respiratory rate, independent of the
heart rate, or urine output. Also in contrast to the other clusters, a patient with high urine output or
NBP, would be sufficient to assign a low probability of the need of vasopressors.
iv Cluster 3 patients exhibit a low heart rate, are non the less in high risk, so this variable alone is not
sufficient to indicate shock. It should be checked if there is also low NBP or low urine.
5.2.4 Demographics
In Figure 5.6 a) one can see that in terms of demography age is not an indication of risk, but one must
take note that it doesn’t mean that the treatment in terms of success is independent of the age, which is
a different thing.
5.2.5 Clusters by pathology
One can see that as expected the is an increase in the incidence of SIRS in the clusters with more
patients with vasopressor, given that is a common treatment for this condition. Not quite expected are
33
0 0.5 10
50
100
150
heart rate mean
frequency
cluster 1
0 0.5 10
50
100
150
heart rate mean
frequency
cluster 2
0 0.5 10
50
100
150
heart rate mean
frequency
cluster 3
0 0.5 10
50
100
150
heart rate mean
frequency
cluster 4
(a) heart rate
0 0.5 10
50
100
SpO2 %
frequency
cluster 1
0 0.5 10
50
100
SpO2 %
frequency
cluster 2
0 0.5 10
50
100
SpO2 %
frequency
cluster 3
0 0.5 10
50
100
SpO2 %
frequency
cluster 4
(b) SpO2%
0 0.5 10
50
100
150
breaths/min
frequency
cluster 1
0 0.5 10
50
100
150
breaths/min
frequency
cluster 2
0 0.5 10
50
100
150
breaths/min
frequency
cluster 3
0 0.5 10
50
100
150
breaths/minfr
equency
cluster 4
(c) Respiratory rates (breaths / min)
0 0.5 10
50
100
150
NBP
frequency
cluster 1
0 0.5 10
50
100
150
NBP
frequency
cluster 2
0 0.5 10
50
100
150
NBP
frequency
cluster 3
0 0.5 10
50
100
150
NBP
frequency
cluster 4
(d) Non invasive blood pressure
0 0.5 10
50
100
150
Urine output
frequency
cluster 1
0 0.5 10
50
100
150
Urine output
frequency
cluster 2
0 0.5 10
50
100
150
Urine output
frequency
cluster 3
0 0.5 10
50
100
150
Urine output
frequency
cluster 4
(e) Urine output
Figure 5.5: Most frequently sampled features histograms.
the results of Figures 5.7 b) and c) where the incidence of both pneumonia and pancriatitis is generally
the same among clusters, while it is known that these are conditions that can evolve to SIRS, or sepsis
in the case of pneumonia.
34
10 30 50 70 900
100
200
age
frequency
cluster 1
10 30 50 70 900
100
200
agefr
equency
cluster 2
10 30 50 70 900
100
200
age
frequency
cluster 3
10 30 50 70 900
100
200
age
frequency
cluster 4
(a) age
0 0.5 10
100
200
sex
frequency
cluster 1
0 0.5 10
100
200
sex
frequency
cluster 2
0 0.5 10
100
200
sex
frequency
cluster 3
0 0.5 10
100
200
sex
frequency
cluster 4
(b) sex (1 - male, 0 - female)
Figure 5.6: Patient distribution per cluster by demographics.
1 2 3 40
50
100
150
200
250
300
350
SIRS
num
ber
(a) SIRS
1 2 3 40
50
100
150
200
250
300
350
Pnemonia
num
ber
with
without
(b) Pneumonia
1 2 3 40
50
100
150
200
250
300
350
Pancriatitis
frequency
(c) Pancriatitis
Figure 5.7: Patient distribution per cluster by pathology.
35
36
Chapter 6
Results
There were two approaches taken. First a one fits-all approach were a single model for all data points
was made and afterwards a combination of models from the clusters obtained in the previous section
(Section 5) . This multi-model approach was then compared with the first single model and the results
were discussed below.
First the general single-model case results are presented followed by multi-model results and finally the
best of all results are presented.
6.1 Single-model Results
For the one-fits-all approach, the following results were obtained after 20 runs, using 60% of the data for
training and 40% for test (Table 6.1). This number of tests was chosen to balance both the computational
time and model consistency.
five features 24 features p-valuesAUC 0.71 ± 0.01 0.80 ± 0.01 < 0.05specificity 0.71 ± 0.02 0.81 ± 0.03 < 0.05sensitivity 0.72 ± 0.03 0.80 ± 0.03 < 0.05accuracy 0.72 ± 0.03 0.80 ± 0.02 < 0.05
Table 6.1: Single-model results.
Table 6.1 shows that model built upon only five features has significantly lower performance (p < 0.05)
that using the whole 24 feature set. The five features were initially chosen as those more readily avail-
able in the ICU, as they had an higher sample rate. No feature selection step was made, so the lower
performance can mean that information was lost as some important variables were left out. In [16] a
feature selection step was made for a similar dataset and the variables White blood cells (WBC) and
sodium were found to have an important role in use of vasopressors prediction.
37
6.2 Multi-model Results
Table 6.2 shows the results belonging to both approaches taken - an a priori decision model based on
clusters centers and a posteriori decision, based on the model output. Results show the mean and
standard deviation after 20 runs, using 60% of the data for training and 40% for test. The a posteriori
approach returned significantly better classification performance shown by the significantly better accu-
racy, but also significantly better specificity and sensitivity (p < 0.05). Thus it returned, on one hand,
a better proportion of positives correctly classified as such, and on the other hand, a lesser number of
false alarm rate as it has more negatives also correctly classified as such.
a priori a posteriori p-valuesAUC 0.81 ± 0.01 0.85 ± 0.00 < 0.05
specificity 0.79 ± 0.06 0.85 ± 0.01 < 0.05sensitivity 0.82 ± 0.06 0.84 ± 0.01 < 0.05accuracy 0.80 ± 0.03 0.85 ± 0.01 < 0.05
Table 6.2: Multi-model results after optimization.
In addition, while using the reduced set to build a general model returned worse results, using the model
obtained through clustering with the same information, it then lead to a multi-model scheme that returned
good results in comparison.
Lower results were expected in the case of the a priori model, as the models were based on the clus-
tering of a reduced dataset suggesting a possible loss of information, lowering the prediction rate of
the models used. The overall result was nonetheless significantly better than the one generic model
obtained from the five feature dataset of Table 6.1.
6.3 Best results comparison
Table 6.3 shows the compilation of both single-model and the best multi-model results. The best of the
multi-model schemes returned significantly better performance (p < 0.05) than the single-model case.
single-model multi-model p-valuesAUC 0.80 ± 0.01 0.85 ± 0.00 < 0.05
specificity 0.81 ± 0.03 0.85 ± 0.01 < 0.05sensitivity 0.80 ± 0.03 0.84 ± 0.01 < 0.05accuracy 0.80 ± 0.02 0.85 ± 0.01 < 0.05
Table 6.3: Single-model and multi-model results.
38
Chapter 7
Conclusions
This work proposed two decision approaches for a multi-model layout based on fuzzy clustering. The
two decision criteria proposed are: 1) an a priori decision based on the distance from the clusters cen-
ters to the patient characteristics, and 2) an a posteriori decision approach where each model was used
and the final outcome is based on the uncertainty of the output response to the threshold of each model.
Fuzzy clustering proved to be a suitable tool for finding patterns in the patients data, as relationship
between the obtained clusters and the output variable (administration of vasopressors) was found. The
information gathered was insightful in the light of the medical information. Some conclusions could be
taken by the analysis of the patients characteristics in each identified cluster. This information was used
to obtain patient specific models.
Significantly better results were obtained using a multi-model scheme than using a general one, to pre-
dict if a patient needs vasopressors. This suggests that using patient specific models is more appropriate
than a single general model.
In terms of decision criteria used - decision a priori, and decision a postriori - the results were signif-
icantly better for the a posteriori decision rule. Nonetheless, more tests with additional datasets are
recomended before establishing one approach as better than the other.
As future work, the multi-model approach can be applied to other applications beyond clinical data. The
variables used could be selected through a process of feature selection to avoid possible information
loss and to improve performance.
Regarding further extensions to this work, the new fluids vasopressor data set could also be used and
compared, while another attempt at clustering with more clusters could be performed in hope of finding
a cluster with a more significant amount of patients with vasopressors.
39
40
Appendix A
Data Partitioning - GK Results
In the course of the data partitioning study, the Gustafson-Kesseln (GK) clustering algorithm was also
used. After acquiring no better results that the FCM algorithm, it was later dropped in favor of the latter.
The results are nonetheless presented below for reference.
As done for the FCM case, all the results were obtained from a batch of 20 and selecting the mean
values, for a series of clusters ranging from 2 to 20.
A.1 All data All features
For all the data points and features, the same group of validity measures, partition index, separation
index and Xie-Beni index, were calculated for the GK Clustering was as follows (Figure A.1).
2 4 6 8 10 12 14 16 18 200
0.5
1
1.5
2
2.5
3x 10
9 Partition Index
Nc
Sc
(a) Partition Index (Sc)
2 4 6 8 10 12 14 16 18 200
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5x 10
4 Separation Index
Nc
S
(b) Separation Index (S)
2 4 6 8 10 12 14 16 18 200
0.2
0.4
0.6
0.8
1
1.2
1.4Xie Beni index
Nc
XB
(c) Xie-Beni index (XB)
Figure A.1: Validation measures for the GK clustering of the full data set.
A.2 All Features Last point
A second case was studied, using only the last time sample for each patient. The results are shown in
Figure A.2 .
41
2 4 6 8 10 12 14 16 18 200
10
20
30
40
50
60Partition Index
Nc
Sc
(a) (SC) index
2 4 6 8 10 12 14 16 18 200
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1x 10
−3 Separation Index
Nc
S
(b) (S) index
2 4 6 8 10 12 14 16 18 200.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1.1
1.2Xie Beni index
Nc
XB
(c) (XB) index
Figure A.2: Validation measures using the GK clustering for the last 3 records of each patient and allfeatures.
A.3 5Features
Latter the clustering process was applied to the complete data (including all patient data samples) but
for only the five most frequently sampled features, as done in the FCM case, with the following results in
terms of validity measures per number of clusters (Figure A.3).
2 4 6 8 10 12 14 16 18 200
10
20
30
40
50
60Partition Index
Nc
Sc
(a) (SC) index
2 4 6 8 10 12 14 16 18 200
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1x 10
−3 Separation Index
Nc
S
(b) (S) index
2 4 6 8 10 12 14 16 18 200
1
2
3
4
5
6
7Xie Beni index
Nc
XB
(c) (XB) index
Figure A.3: Validation measures for GK Clustering of last 3 records of each patient and the 5 featuresubset data.
A.4 LastRecords5Features
Finally a combination of the last two steps was tried out, thus for the last time samples of each patient
and for only the five most frequently sampled features. Figure A.4 shows the results of the validity mea-
sures for this test. In this case, only the numbers of clusters ranging from 2 to 10 was tested, instead of
up to 20 as in the other sections.
42
1 2 3 4 5 6 7 8 9 10 110
2
4
6
8
10
12x 10
5 Partition Index
Nc
Sc
(a) (SC) index
1 2 3 4 5 6 7 8 9 10 11200
400
600
800
1000
1200
1400
1600
1800Separation Index
Nc
S
(b) (S) index
1 2 3 4 5 6 7 8 9 10 110.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2Xie Beni index
Nc
XB
(c) (S) index
Figure A.4: Validation measures for GK Clustering of the 5 feature subset data and mean of the lastthree records.
43
44
Appendix B
Cluster Histograms
The mean of the last three values of each feature of each patient was obtained and histograms describ-
ing feature distribution by clusters were drawn.
B.1 Physiological variables
First the mean values of each feature for each patient were gathered in Table B.1, followed by the
histograms for all 24 variables (Figures B.1 to B.24).
# cluster 1 cluster 2 cluster 3 cluster 41 0.44 0.72 0.30 0.582 0.75 0.73 0.76 0.703 0.46 0.41 0.47 0.664 0.60 0.63 0.54 0.375 0.60 0.55 0.51 0.486 0.33 0.32 0.39 0.407 0.34 0.31 0.37 0.358 0.53 0.61 0.48 0.539 0.59 0.53 0.52 0.5310 0.50 0.55 0.44 0.4811 0.54 0.59 0.55 0.4512 0.54 0.50 0.52 0.5213 0.52 0.50 0.50 0.5014 0.53 0.47 0.45 0.4615 0.50 0.52 0.52 0.5516 0.50 0.49 0.50 0.4917 0.51 0.50 0.51 0.5118 0.52 0.47 0.49 0.4919 0.67 0.49 0.43 0.3720 0.62 0.50 0.42 0.3921 0.59 0.54 0.53 0.4622 0.57 0.51 0.48 0.4323 0.27 0.33 0.35 0.3924 0.49 0.30 0.18 0.19
Table B.1: Mean of physiological variables by cluster
45
0 0.5 10
50
100
150
Heart ratefr
equency
cluster 1
0 0.5 10
50
100
150
Heart rate
frequency
cluster 2
0 0.5 10
50
100
150
Heart rate
frequency
cluster 3
0 0.5 10
50
100
150
Heart rate
frequency
cluster 4
Figure B.1: Heart rate distribution per cluster.
0 0.5 10
50
100
Temperature
frequency
cluster 1
0 0.5 10
50
100
Temperature
frequency
cluster 2
0 0.5 10
50
100
Temperature
frequency
cluster 3
0 0.5 10
50
100
Temperature
frequency
cluster 4
Figure B.2: Temperature distribution per cluster.
0 0.5 10
50
100
150
SpO2
frequency
cluster 1
0 0.5 10
50
100
150
SpO2
frequency
cluster 2
0 0.5 10
50
100
150
SpO2
frequency
cluster 3
0 0.5 10
50
100
150
SpO2fr
equency
cluster 4
Figure B.3: SpO2 distribution per cluster.
0 0.5 10
50
100
150
Respiratory rate
frequency
cluster 1
0 0.5 10
50
100
150
Respiratory rate
frequency
cluster 2
0 0.5 10
50
100
150
Respiratory rate
frequency
cluster 3
0 0.5 10
50
100
150
Respiratory rate
frequency
cluster 4
Figure B.4: Respiratory rate distribution per cluster.
0 0.5 10
50
100
150
GCS total
frequency
cluster 1
0 0.5 10
50
100
150
GCS total
frequency
cluster 2
0 0.5 10
50
100
150
GCS total
frequency
cluster 3
0 0.5 10
50
100
150
GCS total
frequency
cluster 4
Figure B.5: GCS total distribution per cluster.
0 0.5 10
50
100
150
Braden score
frequency
cluster 1
0 0.5 10
50
100
150
Braden score
frequency
cluster 2
0 0.5 10
50
100
150
Braden score
frequency
cluster 3
0 0.5 10
50
100
150
Braden score
frequency
cluster 4
Figure B.6: Braden score distribution per cluster.
46
0 0.5 10
50
100
150
Hematocritfr
equency
cluster 1
0 0.5 10
50
100
150
Hematocrit
frequency
cluster 2
0 0.5 10
50
100
150
Hematocrit
frequency
cluster 3
0 0.5 10
50
100
150
Hematocrit
frequency
cluster 4
Figure B.7: Hematocrit distribution per cluster.
0 0.5 10
50
100
150
Platelets
frequency
cluster 1
0 0.5 10
50
100
150
Platelets
frequency
cluster 2
0 0.5 10
50
100
150
Platelets
frequency
cluster 3
0 0.5 10
50
100
150
Platelets
frequency
cluster 4
Figure B.8: Platelets distribution per cluster.
0 0.5 10
50
100
150
WBC
frequency
cluster 1
0 0.5 10
50
100
150
WBC
frequency
cluster 2
0 0.5 10
50
100
150
WBC
frequency
cluster 3
0 0.5 10
50
100
150
WBCfr
equency
cluster 4
Figure B.9: White blood cells (WBC) distribution per cluster.
0 0.5 10
50
100
150
Hemoglobin
frequency
cluster 1
0 0.5 10
50
100
150
Hemoglobin
frequency
cluster 2
0 0.5 10
50
100
150
Hemoglobin
frequency
cluster 3
0 0.5 10
50
100
150
Hemoglobin
frequency
cluster 4
Figure B.10: Hemoglobin distribution per cluster.
0 0.5 10
50
100
150
RBC
frequency
cluster 1
0 0.5 10
50
100
150
RBC
frequency
cluster 2
0 0.5 10
50
100
150
RBC
frequency
cluster 3
0 0.5 10
50
100
150
RBC
frequency
cluster 4
Figure B.11: Red blood cells (RBC) distribution per cluster.
0 0.5 10
50
100
150
BUN
frequency
cluster 1
0 0.5 10
50
100
150
BUN
frequency
cluster 2
0 0.5 10
50
100
150
BUN
frequency
cluster 3
0 0.5 10
50
100
150
BUN
frequency
cluster 4
Figure B.12: Blood urea nitrogen (BUN) distribution per cluster.
47
0 0.5 10
50
100
150
Creatinine
frequency
cluster 1
0 0.5 10
50
100
150
Creatinine
frequency
cluster 2
0 0.5 10
50
100
150
Creatinine
frequency
cluster 3
0 0.5 10
50
100
150
Creatinine
frequency
cluster 4
Figure B.13: Creatinine distribution per cluster.
0 0.5 10
50
100
150
Glucose
frequency
cluster 1
0 0.5 10
50
100
150
Glucose
frequency
cluster 2
0 0.5 10
50
100
150
Glucose
frequency
cluster 3
0 0.5 10
50
100
150
Glucose
frequency
cluster 4
Figure B.14: Glucose distribution per cluster.
0 0.5 10
50
100
150
Potassium
frequency
cluster 1
0 0.5 10
50
100
150
Potassium
frequency
cluster 2
0 0.5 10
50
100
150
Potassium
frequency
cluster 3
0 0.5 10
50
100
150
Potassium
frequency
cluster 4
Figure B.15: Potassium distribution per cluster.
0 0.5 10
50
100
150
Chloride
frequency
cluster 1
0 0.5 10
50
100
150
Chloride
frequency
cluster 2
0 0.5 10
50
100
150
Chloride
frequency
cluster 3
0 0.5 10
50
100
150
Chloride
frequency
cluster 4
Figure B.16: Chloride distribution per cluster.
0 0.5 10
50
100
150
Sodium
frequency
cluster 1
0 0.5 10
50
100
150
Sodium
frequency
cluster 2
0 0.5 10
50
100
150
Sodium
frequency
cluster 3
0 0.5 10
50
100
150
Sodium
frequency
cluster 4
Figure B.17: Sodium distribution per cluster).
48
0 0.5 10
50
100
150
Magnesium
frequency
cluster 1
0 0.5 10
50
100
150
Magnesium
frequency
cluster 2
0 0.5 10
50
100
150
Magnesium
frequency
cluster 3
0 0.5 10
50
100
150
Magnesium
frequency
cluster 4
Figure B.18: Magnesium distribution per cluster.
0 0.5 10
50
100
150
NBP
frequency
cluster 1
0 0.5 10
50
100
150
NBP
frequency
cluster 2
0 0.5 10
50
100
150
NBP
frequency
cluster 3
0 0.5 10
50
100
150
NBP
frequency
cluster 4
Figure B.19: Non-invasive blood pressure (NBP) distribution per cluster.
0 0.5 10
50
100
150
NBP mean
frequency
cluster 1
0 0.5 10
50
100
150
NBP mean
frequency
cluster 2
0 0.5 10
50
100
150
NBP mean
frequency
cluster 3
0 0.5 10
50
100
150
NBP meanfr
equency
cluster 4
Figure B.20: NBP mean distribution per cluster.
0 0.5 10
50
100
150
Arterial pH
frequency
cluster 1
0 0.5 10
50
100
150
Arterial pH
frequency
cluster 2
0 0.5 10
50
100
150
Arterial pH
frequency
cluster 3
0 0.5 10
50
100
150
Arterial pH
frequency
cluster 4
Figure B.21: Arterial pH distribution per cluster.
0 0.5 10
100
Arterial base excess
frequency
cluster 1
0 0.5 10
100
Arterial base excess
frequency
cluster 2
0 0.5 10
100
Arterial base excess
frequency
cluster 3
0 0.5 10
100
Arterial base excess
frequency
cluster 4
Figure B.22: Arterial base excess distribution per cluster.
0 0.5 10
50
100
150
Lactic acid
frequency
cluster 1
0 0.5 10
50
100
150
Lactic acid
frequency
cluster 2
0 0.5 10
50
100
150
Lactic acid
frequency
cluster 3
0 0.5 10
50
100
150
Lactic acid
frequency
cluster 4
Figure B.23: Lactic Acid distribution per cluster.
49
0 0.5 10
50
100
150
Urine output
frequency
cluster 1
0 0.5 10
50
100
150
Urine output
frequency
cluster 2
0 0.5 10
50
100
150
Urine output
frequency
cluster 3
0 0.5 10
50
100
150
Urine output
frequency
cluster 4
Figure B.24: Urine output distribution per cluster.
50
B.2 Variables at ICU entrance/exit
Then the same was applied to variables obtained at ICU entrance, age, sex, SOFA score, and those
obtained at ICU exit, such as mortality (Figures B.25 to B.28).
10 30 50 70 900
100
200
age
frequency
cluster 1
10 30 50 70 900
100
200
agefr
equency
cluster 2
10 30 50 70 900
100
200
age
frequency
cluster 3
10 30 50 70 900
100
200
age
frequency
cluster 4
Figure B.25: Age distribution per cluster.
0 0.5 10
100
200
sex
frequency
cluster 1
0 0.5 10
100
200
sex
frequency
cluster 2
0 0.5 10
100
200
sex
frequency
cluster 3
0 0.5 10
100
200
sex
frequency
cluster 4
Figure B.26: Sex distribution per cluster (M-1, F-0).
0 0.5 10
100
200
mortality
frequency
cluster 1
0 0.5 10
100
200
mortality
frequency
cluster 2
0 0.5 10
100
200
mortality
frequency
cluster 3
0 0.5 10
100
200
mortality
frequency
cluster 4
Figure B.27: Mortality distribution per cluster (1 if patient died during ICU stay).
0 2 40
100
200
SOFA score
frequency
cluster 1
0 2 40
100
200
SOFA score
frequency
cluster 2
0 2 40
100
200
SOFA score
frequency
cluster 3
0 2 40
100
200
SOFA score
frequency
cluster 4
Figure B.28: SOFA score distribution per cluster.
B.3 Output variable
Also shown is the distribution of patients with and without vasopressor administered, the output variable
of our study (Figure B.29).
51
0 0.5 10
100
200
300
on vasopressorsfr
equency
cluster 1
0 0.5 10
100
200
300
on vasopressors
frequency
cluster 2
0 0.5 10
100
200
300
on vasopressors
frequency
cluster 3
0 0.5 10
100
200
300
on vasopressors
frequency
cluster 4
Figure B.29: Vasopressors administration distribution (1 if they were administered).
B.4 Cluster distribution by pathology
Finally the histograms pertaining some of the clinical conditions prone to the development of distributive
shock: pneumonia, pancriatitis and systemic inflammatory response syndrome (Figures B.30 to B.32).
0 0.5 10
100
200
Pneumonia
frequency
cluster 1
0 0.5 10
100
200
Pneumonia
frequency
cluster 2
0 0.5 10
100
200
Pneumonia
frequency
cluster 3
0 0.5 10
100
200
Pneumonia
frequency
cluster 4
Figure B.30: Pneumonia patients distribution per cluster. (1-with pneumonia. 0-without)
0 0.5 10
100
200
300
Pancriatitis
frequency
cluster 1
0 0.5 10
100
200
300
Pancriatitis
frequency
cluster 2
0 0.5 10
100
200
300
Pancriatitis
frequency
cluster 3
0 0.5 10
100
200
300
Pancriatitis
frequency
cluster 4
Figure B.31: Pancriatitis patient distribution per cluster (1-with, 0-without).
0 0.5 10
100
200
300
SIRS
frequency
cluster 1
0 0.5 10
100
200
300
SIRS
frequency
cluster 2
0 0.5 10
100
200
300
SIRS
frequency
cluster 3
0 0.5 10
100
200
300
SIRS
frequency
cluster 4
Figure B.32: SIRS patient distribution per cluster. (1-with, 0-without)
52
Appendix C
Projections
Figure C.1 shows the projection of each variable of the five feature reduced set, plotted two by two,
the centers of the clusters represented by a black circle. The negatives (patients without vasopres-
sors) are presented in dark grey, while lighter gray represents the positives (patients with vasopressors
administred).
0 0.5 10
0.5
1
0 0.5 10
0.5
1
0 0.5 10
0.5
1
0 0.5 10
0.5
1
0 0.5 10
0.51
0 0.5 10
0.5
1
0 0.5 10
0.5
1
0 0.5 10
0.5
1
0 0.5 10
0.5
1
0 0.5 10
0.5
1
Figure C.1: projections
53
54
References
[1] C. A. Graham and T. R. J. Parke. Critical care in the emergency department: shock and circulatory
support. Emergency Medicine Journal, 22(1):17–21, 2005.
[2] Stefan Herget-Rosenthal, Fuat Saner, and Lakhmir S Chawla. Approach to hemodynamic shock
and vasopressors. Clinical Journal of the American Society of Nephrology, 3(2):546–53, 2008.
[3] R. C. Bone, R. A. Balk, R. A. Cerra, R. P. Dellinger, A. M. Fein, W. A. Knaus, R. M. Schein, and W. J.
Sibbald. Definitions for sepsis and organ failure and guidelines for the use of innovative therapies
in sepsis. Chest, 6(101):1644–1655, 1992.
[4] D. J. Hand, H. MANNILA, and P. SMYTH. Principles of data mining. Adaptative computation and
machine learning series. MASSACHUSETTS INSTITUTE OF TECHNOLOGY. MIT, 2001.
[5] H. C. Koh and G. Tan. Data mining applications in healthcare. Journal of healthcare information
management, 19(2):64–72, 2005.
[6] Illhoi Yoo, Patricia Alafaireet, Miroslav Marinov, Keila Pena-Hernandez, Rajitha Gopidi, Jia-Fu
Chang, and Lei Hua. Data mining in healthcare and biomedicine: a survey of the literature. J
Med Syst, 36(4):2431–48, 2012.
[7] Asaad Y. Shamseldin, Kieran M. O’Connor, and G.C. Liang. Methods for combining the outputs of
different rainfall–runoff models. Journal of Hydrology, 197(1–4):203 – 229, 1997.
[8] Lihua Xiong, Asaad Y. Shamseldin, and Kieran M. O’Connor. A non-linear combination of the fore-
casts of rainfall-runoff models by the first-order takagi–sugeno fuzzy system. Journal of Hydrology,
245(1–4):196 – 217, 2001.
[9] A. P. Weigel, M. A. Liniger, and C. Appenzeller. Can multi-model combination really enhance the
prediction skill of probabilistic ensemble forecasts? Quarterly Journal of the Royal Meteorological
Society, 134(630):241–260, 2008.
[10] L. F. Mendonca, J. M. C. Sousa, and J. M. G. Sa da Costa. An architecture for fault detection and
isolation based on fuzzy methods. Expert Systems with Applications, 36(2):1092–1104, 2009.
[11] R. Polikar. Ensemble learning. 4(1):2776, 2009.
[12] Leo Breiman. Random forests. Machine Learning, 45(1):5–32, 2001.
55
[13] Moacir P Ponti Jr. Combining classifiers: From the creation of ensembles to the decision fusion. In
2011 24th SIBGRAPI Conference on Graphics, Patterns, and Images Tutorials, pages 1–10. IEEE,
2011.
[14] L.I. Kuncheva. Combining Pattern Classifiers: Methods and Algorithms. Wiley, 2004.
[15] Pinaki. Chowdhury, Sukhendu. Das, Suranjana. Samanta, and Utthara. Mangai. A Survey of De-
cision Fusion and Feature Fusion Strategies for Pattern Classification. IETE Technical Review,
27(4):293–307, 2010.
[16] A. S. Fialho, F. Cismondi, S. M. Vieira, J. M. C. Sousa, S. R. Reti, L. A. Celi, M. D. Howell, and
S. N. Finkelstein. Fuzzy modeling to predict administration of vasopressors in intensive care unit
patients. In 2011 IEEE International Conference on Fuzzy Systems (FUZZ), pages 2296–2303,
june 2011.
[17] Federico Cismondi, Abigail L. Horn, Andre S. Fialho, Susana M. Vieira, Shane R. Reti, Joao
M. C. Sousa, and Stan Finkelstein. Multi-stage modeling using fuzzy multi-criteria feature selec-
tion to improve survival prediction of ICU septic shock patients. Expert Systems with Applications,
39(16):12332–12339, 2012.
[18] Rudolf Kruse, Christian Doring, and Marie-Jeanne Lesot. Fundamentals of fuzzy clustering, 2007.
[19] M. Halkidi, Y. Batistakis, and M. Vazirgiannis. Clustering algorithms and validity measures. In Pro-
ceedings of the 13th International Conference on Scientific and Statistical Database Management,
SSDBM ’01, page 3, Washington, DC, USA, 2001. IEEE Computer Society.
[20] J. MacQueen. Some methods for classification and analysis of multivariate observations. In In 5th
Berkeley Symposium on mathematics, Statistics and Probability, pages 281–298, 1967.
[21] L. A. Zadeh. Fuzzy sets. Information and Control, 8(3):338–353, 1965.
[22] J. C. Dunn. A fuzzy relative of the isodata process and its use in detecting compact well-separated
clusters. Journal of Cybernetics, 3(3):32–57, 1973.
[23] James C. Bezdek. Pattern Recognition with Fuzzy Objective Function Algorithms. Kluwer Academic
Publishers, Norwell, MA, USA, 1981.
[24] J. M. C. Sousa and U. Kaymak. Fuzzy decision making in modeling and control. World Scientific
series in robotics and intelligent systems. World Scientific, 2003.
[25] Yanchi Liu, Zhongmou Li, Hui Xiong, Xuedong Gao, and Junjie Wu. Understanding of internal
clustering validation measures. In Proceedings of the 2010 IEEE International Conference on Data
Mining, ICDM ’10, pages 911–916, Washington, DC, USA, 2010. IEEE Computer Society.
[26] L. Kaufman and P.J. Rousseeuw. Finding groups in data: an introduction to cluster analysis. Wiley
series in probability and mathematical statistics: Applied probability and statistics. Wiley, 1990.
56
[27] David L. Davies and Donald W. Bouldin. A cluster separation measure. Pattern Analysis and
Machine Intelligence, IEEE Transactions on, PAMI-1(2):224–227, 1979.
[28] A. M. Bensaid, L. O. Hall, J. C. Bezdek, L. P. Clarke, M. L. Silbiger, J. A. Arrington, and R. F. Murtagh.
Validity-guided (re)clustering with applications to image segmentation. IEEE Transactions on Fuzzy
Systems, 4(2):112–123, may 1996.
[29] X. L. Xie and G. Beni. A validity measure for fuzzy clustering. IEEE Transactions on Pattern Analysis
and Machine Intelligence, 13(8):841–847, aug 1991.
[30] Y. Fukuyama and M. Sugeno. A new method of choosing the number of clusters for fuzzy c-means
method. 1989.
[31] I. Gath and A.B. Geva. Unsupervised optimal fuzzy clustering. IEEE Transactions on Pattern
Analysis and Machine Intelligence, 11(7):773–780, 1989.
[32] Usama Fayyad, Gregory Piatetsky-shapiro, and Padhraic Smyth. From data mining to knowledge
discovery in databases. AI Magazine, 17:37–54, 1996.
[33] Susana Viera. Soft Computing Techniques Applied to Feature Selection. PhD thesis, Universidade
Tecnica de Lisboa, 2010.
[34] M. Sugeno and T. Yasukawa. A fuzzy-logic-based approach to qualitative modeling. IEEE Transac-
tions on Fuzzy Systems, 1(1):7, feb. 1993.
[35] Susana M. Vieira, Joao M. C. Sousa, and Uzay Kaymak. Fuzzy criteria for feature selection. Fuzzy
Sets and Systems, 189(1):1–18, 2012.
[36] T. Takagi and M. Sugeno. Fuzzy identification of systems and its application to modeling and
control. IEEE Trans. System Man and Cybernetics, (15):116–132, 1985.
[37] J. A. Swets. Measuring the accuracy of diagnostic systems. Science, 240(4857):1285–1293, 1988.
[38] Tom Fawcett. An introduction to roc analysis. Pattern Recogn. Lett., 27(8):861–874, June 2006.
[39] Mohammed Saeed, Mauricio Villarroel, Andrew T. Reisner, Gari Clifford, Li-Wei Lehman, George
Moody, Thomas Heldt, Tin H. Kyaw, Benjamin Moody, and Roger G. Mark. Multiparameter intelli-
gent monitoring in intensive care ii (mimic-ii): A public-access intensive care unit database. Critical
Care Medicine, 39:952–960, May 2011.
[40] G. D. Clifford, W. J. Long, G. B. Moody, and P. Szolovits. Robust parameter extraction for decision
support using multimodal intensive care data. Philosophical Transactions of the Royal Society A:
Mathematical, Physical and Engineering Sciences, 367(1887):411–429, 2009.
[41] P. D. Allison. Missing Data. Number 136 in Quantitative Applications in the Social Sciences. SAGE
Publications, 2001.
57
[42] A. S. Fialho, F. Cismondi, S. M. Vieira, J. M. C. Sousa, S. R. Reti, R. Welsh, Welsch, M. D. Howel,
and S. N. Finkelstein. Missing data in large intensive care units databases. Critical Care Medicine,
38(12):U6–U6, 2010.
[43] D. C. Hoaglin, F. Mosteller, and J. W. Tukey. Understanding robust and exploratory data analysis.
Wiley Classics Library Editions. Wiley, 2000.
[44] F. Cismondi, A. S. Fialho, S. M. Vieira, J. M. C. Sousa, S. R. Reti, M. D. Howell, and S. N. Finkel-
stein. Computational intelligence methods for processing misaligned, unevenly sampled time series
containing missing data. In 2011 IEEE Symposium on Computational Intelligence and Data Mining
(CIDM), pages 224–231, april 2011.
[45] N. R. Pal and J. C. Bezdek. On cluster validity for the fuzzy c-means model. IEEE Transactions on
Fuzzy Systems, 3(3):370–379, aug 1995.
58