josé miguel mourinho rodrigues - autenticação...josé miguel mourinho rodrigues thesis to obtain...

Data mining and modeling to predict the necessity ofvasopressors for sepsis patients

José Miguel Mourinho Rodrigues

Thesis to obtain the Master of Science Degree in

Mechanical Engineering

Examination Comittee

Chairperson: Professor João Rogério Caldas PintoSupervisor: Professor João Miguel da Costa Sousa

Co-supervisor: Doctor Susana Margarida da Silva VieiraMembers of the comittee: Professor Luís Manuel Fernandes Mendonça

June 2013

Acknowledgments

I would like to thank my research advisors, professor Joao Sousa for his professionalism throughout

the whole endeavor, Dr. Susana Vieira for the constant attention and professor Luıs Mendonca for

the constant support and uplifting. For his insights, availability and readiness and help regarding the

database used thanks must also go to Andre Fialho.

A general thanks to all the people at Centro Academico Edith Stein, specialty to those who made their

thesis there, for making me know I’m not alone.

A particular thanks goes to Felipe Blanco, Pedro Viegas, Joana Peleja and Pedro Antunes for their

friendship and keeping me focused on the tasks at hand.

A special thanks goes also to Joao Campos and Paulo Araujo for their prayers and emotional support.

A final word of thanks goes to my family, for their constant support and to whom I dedicate this work.

iii

Resumo

Choque e uma condicao medica de vida ou de morte que requer a administracao de medicamentos

potentes - vasopressores. A identificacao atempada destes doentes para preparacao de terapia e um

objectivo importante.

Foram usadas para o agrupamento - clustering - de doentes um conjunto de variaveis das mais fre-

quentemente amostradas e disponıveis numa unidade de cuidados intensivos. Em seguida, foi iniciado

um processo de exploracao de dados com o uso de fuzzy clustering com o algoritmo fuzzy c-means,

onde quatro clusters foram obtidos e as caracterısticas dos grupos foram analisadas. Uma relacao entre

os clusters obtidos e o uso de vasopressores foi encontrada e estes resultados foram visualizados com

a ajuda de histogramas. Primeiro, um modelo geral foi obtido. Depois, quatro modelos foram treinados

e usados numa abordagem multi-modelo, um para cada dos grupos de doentes identificados. Para a

abordagem multi-modelo, dois criterios de decisao foram utilizados. Primeiro foi usada uma decisao a

priori baseada na distancia entre os centros dos clusters e os centros das caracterısticas dos doentes,

e em seguida, uma decisao a posteriori usando cada um dos modelos e em que o valor final utilizado e

baseado na incerteza da saıda - resposta ao threshold - de cada modelo.

A abordagem multi-modelo com decisao a posteriori teve um melhor desempenho em relacao aos dois

tipos testados, e tambem teve melhores resultados que o modelo geral.

Palavras-chave: Multi-modelo, clustering, modelacao fuzzy, vasopressores

v

Abstract

Shock is a life-threatening medical condition requiring the administration of powerful drugs - vasopres-

sors. Early identification of these patients is a worthy goal in order to timely prepare them for therapy.

A subset composed of the most frequently sampled and readily available variables in an intensive care

unit (ICU) was used for clustering patients. Then, a data exploration process was executed through

fuzzy clustering based on the fuzzy c-means algorithm. Four clusters were obtained and groups char-

acteristics were analyzed. A relationship between the clusters obtained and the use of vasopressors

was found out and these results were visualized with the help of histograms. First, a single general

model was derived. Then, four models were trained and used for a multi model approach, one for each

identified group of patients. For the multi-model approach, two decision criteria were used: 1) an a

priori decision based on the distance from the clusters centers to the patient characteristics, and 2) an

a posteriori decision approach where each model was used and the final outcome used is based on the

uncertainty of the output response to the threshold of each model.

The multi model approach with a posteriori decision returned a better performance of the two tested

schemes, and also performed better than the single general model approach.

Keywords: Multi-model, fuzzy clustering, fuzzy modeling, vasopressors

vii

Contents

Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii

Resumo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii

List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi

List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiv

Nomenclature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvi

1 Introduction 1

1.1 Problem overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Data mining in medical care . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.4 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Clustering 7

2.1 Hierarchical Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2 Partitioning Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.3 Fuzzy Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.3.1 Fuzzy c-means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.3.2 Other fuzzy clustering algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.4 Validation Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.4.1 Partition Coefficient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.4.2 Partition Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.4.3 Partition Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.4.4 Separation Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.4.5 Xie-Beni Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.4.6 Other validation measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3 Knowledge Discovery in Databases 13

3.1 Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.1.1 Fuzzy modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.1.2 Model assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.1.3 Model layouts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

ix

4 Preprocessing of MIMIC II Database 21

4.1 MIMIC II Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4.2 Vasopressors subset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

4.3 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.4 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.5 Fluid subset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

5 Clustering of MIMIC II Database 27

5.1 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

5.1.1 Full dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

5.1.2 First data reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

5.1.3 Clustering with high frequency data . . . . . . . . . . . . . . . . . . . . . . . . . . 28

5.2 Cluster Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

5.2.1 Clusters obtained . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

5.2.2 Cluster centers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

5.2.3 Main features histograms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

5.2.4 Demographics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

5.2.5 Clusters by pathology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

6 Results 37

6.1 Single-model Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

6.2 Multi-model Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

6.3 Best results comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

7 Conclusions 39

A Data Partitioning - GK Results 41

A.1 All data All features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

A.2 All Features Last point . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

A.3 5Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

A.4 LastRecords5Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

B Cluster Histograms 45

B.1 Physiological variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

B.2 Variables at ICU entrance/exit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

B.3 Output variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

B.4 Cluster distribution by pathology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

C Projections 53

References 58

x

List of Tables

4.1 Physiological variables. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

4.2 Static variables. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.3 Chosen data features with the low sample times. . . . . . . . . . . . . . . . . . . . . . . . 24

5.1 Number of patients and features mean values for each cluster. . . . . . . . . . . . . . . . 32

6.1 Single-model results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

6.2 Multi-model results after optimization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

6.3 Single-model and multi-model results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

B.1 Mean of physiological variables by cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

xi

List of Figures

1.1 The interrelationship between SIRS, sepsis and infection. . . . . . . . . . . . . . . . . . . 2

3.1 KDD processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.2 Confusion matrix. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.3 Finding ROC point AUC maximum by summing areas of the trapezoid . . . . . . . . . . . 17

3.4 Multimodel scheme with decision a priori based on cluster centers. . . . . . . . . . . . . . 18

3.5 Multimodel scheme with decision a posteriori. . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.6 Classifier values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4.1 Patient records frequency plot of the first 100 entries. . . . . . . . . . . . . . . . . . . . . . 25

5.1 Validation measures for a set of 20 runs for each number of clusters between 2 and 20

(mean and variance) using the full feature set. . . . . . . . . . . . . . . . . . . . . . . . . . 29

5.2 Validation measures for FCM clustering of the last three records for every patient and the

five most frequently sampled features. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

5.3 Validation measures for the reduced set, and most sampled variables . . . . . . . . . . . 31

5.4 Patients in cluster and vasopressor distribution. . . . . . . . . . . . . . . . . . . . . . . . . 32

5.5 Most frequently sampled features histograms. . . . . . . . . . . . . . . . . . . . . . . . . . 34

5.6 Patient distribution per cluster by demographics . . . . . . . . . . . . . . . . . . . . . . . . 35

5.7 Patient distribution per cluster by pathology. . . . . . . . . . . . . . . . . . . . . . . . . . . 35

A.1 Validation measures for the GK clustering of the full data set. . . . . . . . . . . . . . . . . 41

A.2 Validation measures using the GK clustering for the last 3 records of each patient and all

features. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

A.3 Validation measures for GK Clustering of last 3 records of each patient and the 5 feature

subset data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

A.4 Validation measures for GK Clustering of the 5 feature subset data and mean of the last

three records. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

B.1 Heart rate distribution per cluster. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

B.2 Temperature distribution per cluster. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

B.3 SpO2 distribution per cluster. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

B.4 Respiratory rate distribution per cluster. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

xiii

B.5 GCS total distribution per cluster. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

B.6 Braden score distribution per cluster. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

B.7 Hematocrit distribution per cluster. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

B.8 Platelets distribution per cluster. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

B.9 White blood cells (WBC) distribution per cluster. . . . . . . . . . . . . . . . . . . . . . . . 47

B.10 Hemoglobin distribution per cluster. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

B.11 Red blood cells (RBC) distribution per cluster. . . . . . . . . . . . . . . . . . . . . . . . . . 47

B.12 Blood urea nitrogen (BUN) distribution per cluster. . . . . . . . . . . . . . . . . . . . . . . 47

B.13 Creatinine distribution per cluster. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

B.14 Glucose distribution per cluster. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

B.15 Potassium distribution per cluster. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

B.16 Chloride distribution per cluster. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

B.17 Sodium distribution per cluster). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

B.18 Magnesium distribution per cluster. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

B.19 Non-invasive blood pressure (NBP) distribution per cluster. . . . . . . . . . . . . . . . . . 49

B.20 NBP mean distribution per cluster. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

B.21 Arterial pH distribution per cluster. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

B.22 Arterial base excess distribution per cluster. . . . . . . . . . . . . . . . . . . . . . . . . . . 49

B.23 Lactic Acid distribution per cluster. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

B.24 Urine output distribution per cluster. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

B.25 Age distribution per cluster. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

B.26 Sex distribution per cluster. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

B.27 Mortality distribution per cluster. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

B.28 SOFA score distribution per cluster. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

B.29 Vasopressors administration distribution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

B.30 Pneumonia patients distribution per cluster. . . . . . . . . . . . . . . . . . . . . . . . . . . 52

B.31 Pancriatitis patient distribution per cluster. . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

B.32 SIRS patient distribution per cluster. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

C.1 projections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

xiv

Nomenclature

Abreviatures

AUC Area under curve.

BUN Blood urea nitrogen.

CE Classification Entropy.

FCM Fuzzy c-means algorithm.

GK Gustafson-Kessel algorithm.

ICU Intensive care unit.

KDD Knowledge discovery in databases.

NBP Non-invasive blood pressure.

PC Partition Coefficient.

RBC Red blood cells.

ROC Receiver operating curve.

SAPS Simplified Acute Physiology Score.

SIRS Systemic inflammatory response syndrome.

SOFA Sequential Organ Failure Assessment score.

WBC White blood cells.

XB Xie-Beni index.

Greek symbols

δi threshold of model i.

Roman symbols

C Number of clusters.

e Stopping criteria threshold.

xv

Mi Model i.

Oi Output of model i.

S Separation index.

Sc Partition index.

Subscripts

f fuzzy.

h hard.

i, j, k, l Computational indexes.

Superscripts

c Cluster number.

m Fuzziness index.

xvi

Chapter 1

Introduction

The aim of this study is to address the prediction of the administration of vasopressors in septic shock

patients in intensive care units (ICU). Particularly, our hypothesis is that a multi-model approach, based

on fuzzy clustering, to the prediction of vasopressor need leads to improved performance compared to

a one-model-fits-all approach.

First a problem overview regarding the use of the vasopressors in medical care is given, followed by

a study of data mining techniques used in medical care, then a review of related work and finally the

specific contribution of this dissertation.

1.1 Problem overview

Shock is a life-threatening medical emergency that can be defined as ”acute circulatory failure with in-

adequate or inappropriately distributed tissue perfusion resulting in generalized cellular hypoxia.” [1]

This means that end cells aren’t getting enough blood depriving them of the needed oxygen, which in

turn can lead to the tissues death, causing organ failure.

The maintenance of end-organ perfusion is critical to prevent irreversible organ injury and failure, and

this frequently requires the use of fluid resuscitation and vasopressors [2], which are medicines used to

contract blood vessels so as to increase blood pressure in critical ill patients.

Unlike most clinical conditions for which a clinical diagnosis is made before treatment is initiated, the

treatment of shock often occurs at the same time or even before the diagnostic process [2]. Thus it is

possible for patients that would not require the medication to have it administered anyway.

It had also to be considered the added problem that the procedure of administration is risky. When done

urgently, can lead to possible infections, increasing the costs in the end.

1

If a prediction is made on which patients are going to need vasopressors beforehand, costs will be re-

duced since less patients are going to need the treatment and catheter surgery will be done more timely

and so less prone to medical complications.

sepsis SIRSinfection

bacteremia

fungemia

viremia

other

other

pancriatitis

burns

trauma

Figure 1.1: The interrelationship between SIRS, sepsis and infection.

Figure 1.1 shows the interrelationship between systemic inflammatory response syndrome (SIRS), sep-

sis, and infection. Basically there is a condition marked by a general body inflamatory response, that

if it is caused by a blood-borne infection it is then called sepsis. In case of blood-borne infection it can

be called bacteremia - in the case of pneumonia for example -, fungemia, parasitemia, viremia among

others depending on the source of infection. Nonetheless, this state of inflammatory response can also

have other cause, like burns, trauma or the initial stage of pancriatitis [3] - also a focus of this study. A

common development of these conditions is shock, called septic shock in the case of a development of

sepsis, and it is that condition that is addressed in this study for the prediction of vasopressors.

1.2 Data mining in medical care

Data mining is the analysis of (often large) observational data sets to find unsuspected relationships and

to summarize the data in novel ways that are both understandable and useful to the data owner. [4]

Data mining, while recent, is not something new, as it has been used intensively and extensively in other

areas. In particular by financial institutions, for credit scoring and fraud detection; marketeers, for direct

marketing and cross-selling or up-selling; retailers, for market segmentation and store layout; and man-

ufacturers, for quality control and maintenance scheduling. [5]

2

In medical care the adoption of data mining as been more slow, as to the limited availability of data due

to privacy and legal issues [6]. But this trend is changing as, according to [5] ”In health care data mining

is becoming increasingly popular, if not increasingly essential. Some driving factors are the existence of

medical insurance fraud and abuse, the ever increasing volume of data generated by health care trans-

actions, too complex to treat by traditional methods; financial pressures to increase operating efficiency

while maintaining high level of care; and the realization that data mining can generate information that

is very useful for all parties involved in the industry, from health care insurers and providers to customers.

There are many applications of data mining in the field of health care. In [5] these applications are

grouped in four distinct areas: Evaluation of treatment effectiveness; management of health care; cus-

tomer relationship management; and detection of fraud and abuse.

The limitations of health care data mining deal with the availability and quality of the data. The data has

to be collected and integrated before data mining is attempted. There can be missing, corrupted, incon-

sistent or non-standardized data due to different formats and sources. Also the data can be unavailable

due to ethical, legal and social issues, such as data ownership [5]. Additionally, the databases can be

primarily designed for financial/billing purposes and not for medical/clinical purposes leading in lower

quality clinical data for data mining [6].

The success of health care data mining hinges on the availability of clean health care data. Possi-

ble future directions include the standardization of clinical vocabulary and the sharing of data across

organizations[5].

1.3 Related work

One particular focus of the present work is the combination of multi-models for prediction.

One application of model combination, is in weather forecasts, in particular for hydrological forecasts

of rainfall-runoff models for water discharge prediction such as in [7]. Initial studies have looked into

combining the output of different models through various methods. This study uses the simple average

method, a weighter average method and a neural network method. Also in the merging of forecast

models models were combined non-linearly using fuzzy rules as in [8]. In [9] the authors address the

question of whether multi-model combination can really enhance the prediction skill of ensemble fore-

casts less skillful models? And also if whether a multi-model can perform better than the best single

model available, assuming that there is a ‘best’ model and that it can be identified.

Another use of multi-model combination is in hybrid systems control applications in process industry

where models of different setups are used. One such application is in the field of fault diagnosis, as

3

[10]. Here a multi-model architecture is used where a fuzzy decision making approach is used to isolate

faults based on the analysis of the residuals, the difference between the model output and the identified

models output with and without faults.

Also of relevance is the machine learning field of ensemble learning as the process by which multiple

models, such as classifiers or experts, are strategically generated and combined to solve a particular

computational intelligence problem [11]. Ensemble learning is primarily used to improve the (classifi-

cation, prediction, function approximation, etc.) performance of a model, or reduce the likelihood of an

unfortunate selection of a poor one. Random forest [12] is a representative algorithm that consists of

many decision trees that vote to select class membership.

A recent paper [13] where the task of combining classifiers, from the creation of ensembles to the deci-

sion fusion is addressed, but for further study the book [14] on combining pattern classifiers, from which

the first paper was but a summary, is advised. More recently a survey of decision fusion and feature fu-

sion strategies for pattern classification is available in [15]. And finally, for further research, some related

areas are the fields of data fusion, decision fusion, ensemble learning, clustering ensembles.

1.4 Contributions

This thesis follows the work done on the vasopressor data subset of MIMICII, as compiled by Andre

Fialho, as indicated in [16]. In the present work the dataset is similar but a different approach is taken.

In particular an alternative selection of features is presented and they are used for clustering patients

with focus on visualizing the results.

Like [17] it builds upon the notion of using a multi-model architecture and combining them to create a

better performing classification algorithm. Unlike that work though, in the present work, fuzzy clustering

is used to build these models and different methodologies are presented to combine the models used.

In [6] it is said that in biomedicine clustering is normally used for microarray data analysis rather than

general health care data analysis as in that application there is very little information known about genes,

while there is more information known about health conditions and disease symptoms. Mentioning, in

particular, that clustering is often used when there is no or little information available. The present study

contributes to show that there is a place for clustering in healthcare, either as an aid to modeling, or

used alone to gain new insights and confirm medical information.

Finally, following this work a paper was submitted and accepted for presentation at the 2013 IEEE

International Conference on Fuzzy Systems (FUZZ-IEEE 2013) and for publication in the conference

proceedings published by IEEE.

4

Chapter 2

Clustering

Clustering is an unsupervised learning task that aims at decomposing a given set of objects into sub-

groups or clusters based on similarity. The goal is then to divide the data-set in such a way that objects

belonging to the same cluster are as similar as possible, whereas objects belonging to different clusters

are as dissimilar as possible [18].

Cluster analysis is primarily a tool for discovering previously hidden structure in a set of unordered

objects. In this case one assumes that a ‘true’ or natural grouping exists in the data. However, the as-

signment of objects to the classes and the description of these classes are unknown [18]. The grade of

similarity is obtained through distance functions that are part of the clustering methods and that measure

the dissimilarity of presented example cases.

Traditionally clustering techniques have been divided in two main groups: hierarchical and partitioning

[19], though more groupings can be stated, such as density or grid based clustering, for example.

2.1 Hierarchical Clustering

Hierarchical techniques organize data in a nested sequence of groups, which can be visualized in the

form of a dendrogram or tree. Based on a dendrogram one can decide on the number of clusters at

which the data are best represented for a given purpose [18].

These methods can be further divided into agglomerative and divisive methods, whether a bottom up

sequential aggregation of data point in a tree is made or rather a top-down tree division of data into

several is done made instead.

The most used algorithms for hierarchical clustering are known as single linkage (also know as nearest

neighbor) and complete linkage (furthest distance), based on the distance function used.

7

It is necessary to note that, in this type of clustering, once a data point is set as belonging to a given

cluster it is so until the end, making it more sensitive to outliers and initial conditions.

2.2 Partitioning Clustering

Another clustering type is partitioning clustering. Given a positive integer c, these algorithms aim at

finding the best partition of the data into c groups based on the given dissimilarity measure and they

regard the space of possible partitions into c subsets only [18].

One of the more known partition clustering methods is the K-means. Many other methods were de-

signed based on variations of part of this algorithm, such as K-medoids that uses points from the set as

centers (medoids), or K-medians that uses medians instead of means.

K-means [20] is one of the simplest unsupervised learning algorithms that solve the well known cluster-

ing problem. This algorithm aims at minimizing an objective function (Eq. 2.1), where dij is the distance

function (for the standard k-means this is the euclidean distance) and uij the partition matrix.

Jh(X,Uh, C) =

c∑i=1

n∑j=1

uijd2ij (2.1)

K-means is also known as ’hard’ c-means in the light of the fuzzy c-means clustering algorithm to be

described, as in classical (hard) cluster analysis each datum is assigned to exactly one cluster.

2.3 Fuzzy Clustering

If we relax the requirement uij ∈ {0, 1} that is placed on the cluster assignment in hard partitioning

approaches, so as to allow gradual memberships of data points measured as degrees in [0, 1]. That way

we thus enable the belonging of a data point to more than one cluster. The concept of these membership

degrees is built upon the notions of fuzzy sets as introduced in [21].

2.3.1 Fuzzy c-means

Fuzzy partitioning is carried out through an iterative optimization of the objective function (Eq. 2.2), with

the update of membership uij and the cluster centers cj .

8

Jf (X,Uh, C) =

c∑i=1

n∑j=1

(uij)md2ij (2.2)

where uij is the partition matrix, m is a number > 1 and dij is a distance function, for the standard FCM

case, the euclidean distance.

The problem is then the optimization of the objective function Jh (Eq. 2.2) where uij is the partition

matrix. This method was first described in [22] with m=2, and latter generalized in [23] with the formula

in its current form.

uij =1∑c

l=1(d2ij

d2lj)

1m−1

(2.3)

ci =

∑nj=1 uijxj∑nj=1 uij

(2.4)

2.3.2 Other fuzzy clustering algorithms

Like k-means, there are many possible variations which lead to other algorithms that can take advantage

of some particular case known in advance, like the presence of some given shapes (elipsoids, lines), or

noise.

One notable variation is Gustafson-Kesseln (GK) clustering algorithm, where the distance function is

changed to the Mahalanobis distance in order to detect clusters of different size and orientation. This

allows it to extract more information but makes the algorithm more sensitive to initialization and with

higher computational demands.

Also of note is kernel based fuzzy clustering where the distance function is further modified to include

non-vectorial data such as sequences, trees or graphs.

2.4 Validation Measures

Usually the number of (true) clusters in the given data is unknown in advance. However, using the

partitioning methods one is usually required to specify the number of clusters c as an input parameter.

Estimating the actual number of clusters is thus an important issue [18].

The name given to these criteria is not consistent, given the different names attributed to them in litera-

ture. In particular they are called validation measures, validity criteria, evaluation measures, or validity

9

indices. As with [24] the term used in this work will be validation measures.

Such measures can be used to evaluate quantitatively the clustering quality and to compare algorithms

one with another. They can also be applied to compare the results obtained with a single algorithm,

when the parameter values are changed. In particular they can be used in order to select the optimal

number of clusters: applying the algorithm for several c values, the value c∗ leading to the optimal de-

composition according to the considered criterion is selected [18].

External clustering validation and internal clustering validation are the two main categories of clustering

validation. The main difference is whether or not external information is used [25]. Internal validation

measures only rely on information in the data , and are therefore applicable to situations were there is

no previous knowledge is known, such as a true number of clusters or previously known classes and so

are more suitable for an exploratory knowledge discovery process such as the one in this study and are

less database specific.

In literature, a number of internal clustering validation measures for crisp clustering have been proposed,

such as Dunn index [22], silhouette index [26] or Davies-Bouldin index [27]. But starting with Bezdek

in 1975 [23], there were others measures proposed specifically for fuzzy clustering that use information

about the partition matrix and other fuzzy clustering parameters. Also, using a validity measure intended

for crisp clustering in fuzzy clustering would make the results dependent on some kind of defuzzyfication

scheme.

For these reasons we focused on internal and fuzzy validation measures alone, some of which are now

presented:

2.4.1 Partition Coefficient

Partition coefficient measures the amount of ”overlapping” between clusters. It is defined by Bezdek as

follows [23]:

PC =1

n

c∑i

n∑j

u2ij (2.5)

2.4.2 Partition Entropy

The validation measure partition entropy computes the entropy of the obtained membership degrees,

and must be minimized. It measures the fuzziness of the cluster partition only, akin to the Partition

Coefficient and thus defined [23]:

PE = − 1

n

c∑i

n∑j

uij log uij (2.6)

10

2.4.3 Partition Index

Partition index is the ratio of the sum of compactness and separation of the clusters. It is a sum of

individual cluster validity measures normalized through division by the fuzzy cardinality of each cluster

[28] :

Sc(c) =

c∑i=1

∑Nj=1(µij)

m||xj − vi||2

Ni

∑ck=1 ||xj − vi||

2 (2.7)

2.4.4 Separation Index

On the contrary of partition index Sc, the separation index uses a minimum-distance separation for

partition validity [28]:

S(c) =

∑Ci=1

∑Nj=1(µij)

2||xj − vi||2

Nmin i,k||vk − vi||2(2.8)

2.4.5 Xie-Beni Index

Also for fuzzy clustering and with a widespread use is the Xie-Beni index, which aims to quantify the

ratio of the total variation within clusters and the separation of clusters. [29]:

XB(c) =

∑Ci=1

∑Nj=1(µij)

m||xj − vi||2

Nmin i,j ||xj − vi||2(2.9)

A better clustering is obtained by minimizing XB(c). However, it was shown that Xie-Beni index has

a problem with an high number of cluster where the behavior of the index is shown to be decreas-

ingly monotonic. This can be addresses by calculating one of the corrected xie-beni index available

in literature, though in practical terms only rather small number of clusters is often sought and so the

uncorrected Xie-Beni index is still used, given its widespread use.

2.4.6 Other validation measures

Other possible validation measures that weren’t used in this study, but are used often enough to at least

name them are Fukuyama-Sugeno index [30] and Gath and Geva’s validation measures fuzzy hypervol-

ume and partition density [31]. These last two, in particular, are more suited to GK clustering, for they

require the calculation the covariance matrix, which is already part of the GK clustering algorithm.

11

Chapter 3

Knowledge Discovery in Databases

Knowledge discovery in databases (KDD) is the nontrivial process of identifying valid, novel, potentially

useful, and ultimately understandable patterns in data [32].

Figure 3.1: KDD processes. [32] [33]

According to Fayyad in [32] there are essentially five steps in the KDD process. These are selection,

preprocessing, transformation, data mining and interpretation (also called evaluation). For an engineer-

ing background (systems modeling), these steps could be called data acquisition, data preprocessing,

feature selection, modeling and interpretation without significant loss of meaning as in [33] and shown

in Figure 3.1. The overall process of finding and interpreting patterns from data involves the repeated

application of the these steps, now explained in more detail:

1. Data acquisition / selection - Compromises the creation or selection of a target dataset, or par-

ticular variables on which to perform data discovery on. It is an important task as the quality of the

13

data will impact on the quality of the data discovery process.

2. Data preprocessing - This step focuses on the cleaning or preprocessing of the data. In particular

it deals with noise or outliers removal, noise modeling, handling missing data fields and accounting

for time sequence information.

3. Feature selection / Transformation - Includes data reduction and projection so as to represent

data in a more ready to treatment for data mining. It tries on one hand to select a fewer variables

that represent the data, while one the other, eliminating redundancy that can impair the data mining

process.

4. Modeling / Data mining) - In this step the selection of a data mining method is made depending

on the goal of the data. Six common tasks are:

• Anomaly detection (Outlier/change/deviation detection) - The identification of unusual data

records, that might be interesting or data errors that require further investigation.

• Association rule learning (Dependency modeling) – Searches for relationships between vari-

ables, sometimes referred to as market basket analysis.

• Clustering – is the task of discovering groups and structures in the data that are in some way

or another ”similar”, without using known structures in the data.

• Classification – is the task of generalizing known structure to apply to new data.

• Regression – Attempts to find a function which models the data with the least error.

• Summarization – providing a more compact representation of the data set, including visual-

ization and report generation.”

5. Interpretation - In this step, the patterns obtained through data mining are evaluated to see if their

interesting or not, thus it can also be called evaluation. The duty is to represent the result in an

appropriate way so that it can be examined in a thoroughly way. If the pattern is not interesting,

the cause for it has to be found out, and more attempts can be made and some previous steps

redone.

3.1 Modeling

The choice of the modeling technique to be used may depend on many factors, including the source of

the data set and the values that it contains. Methods based on fuzzy systems inherit model transparency

and enjoy good function approximation properties. For this reason fuzzy modeling was used in this works

along with fuzzy clustering.

3.1.1 Fuzzy modeling

Fuzzy modeling is a tool that allows approximation of nonlinear systems when there is little or no previous

knowledge of the problem to be modeled [34].

14

This approach provides a transparent, non-crisp model, and also possibilitates a linguistic interpretation

in the form of rules and logical connectedness. These are used to establish relations between the de-

fined features in order to derive a model.

A fuzzy classifier contains a rule base consisting of a set of fuzzy if-then rules together with a fuzzy

inference mechanism. These models ultimately classify each instance of the dataset as pertaining with

a certain degree, to one of the possible classes defined for the specific problem being modeled.

As suggested in [35], a discriminant method was used in this work, where the classification is based on

the largest discriminant function. In this method, a separate discriminant function dc(x) is associated

with each class wc, with c = 1, ..., C. The discriminant functions can be implemented as fuzzy inference

systems. Here, we use Takagi-Sugeno (TS) fuzzy models [36] with which, each discriminant function

consists of rules of the type:

Rule Rci : If x1 is Ai1 and ... and xM is AiM

then dc(x) = f ci , i = 1, 2, ...,K,

where f ci is the consequent function for rule Rci . In these K rules, the index c indicates that the rule

is associated with the output class c. Note that the antecedent parts of the rules can be for different

discriminants, as well as the consequents. The classifier assigns the class label corresponding to the

maximum value of the discriminant functions, i.e.

maxc dc(x). (3.1)

3.1.2 Model assessment

In describing the performance of binary classifiers, the accuracy of classification cannot be considered

alone [37]. Both the sensitivity, or hit rate, and specificity, or true rejection rate, must also be analyzed.

In medical diagnosis and in the machine learning community, one of the methods for combining these

two measures into the evaluation task is the analysis of the area under the ROC curve (AUC), with ROC

being the receiver operating curve, a graphical plot which illustrates the performance of a binary clas-

sifier system as its discrimination threshold is varied. ROC curves allow a visualization of the trade-off

between hit rates and false alarm rate of classifiers [38].

In this work, the measures used to assess the quality of the obtained classifier were the AUC, specificity

(3.2), sensitivity (3.3) and accuracy (3.4), which can be calculated as:

specificity = 1− FP rate (3.2)

15

sensitivity = TP rate (3.3)

accuracy =TP + TN

P +N(3.4)

where,

FP rate =FP

N(3.5)

TP rate =TP

P(3.6)

and P are positive, N negatives, TP true-positives, TN true-negatives, and FP false-positives.

All which are part of a matrix widely used in pattern recognition called the confusion matrix, and which is

used to represent errors in assigning classes to observed patterns in which the ijth element represents

the number of samples from class i which were classified as class j.

Sensitivity and specificity are statistical measures of the performance of a binary classification test, also

known in statistics as classification function. Sensitivity (also called the true positive rate, or the recall

rate in some fields) measures the proportion of actual positives which are correctly identified as such

(e.g. the percentage of sick people who are correctly identified as having the condition). Specificity mea-

sures the proportion of negatives which are correctly identified as such (e.g. the percentage of healthy

people who are correctly identified as not having the condition, sometimes called the true negative rate).

These two measures are closely related to the concepts of type I and type II errors. A perfect predictor

would be described as 100% sensitive (i.e. predicting all people from the sick group as sick) and 100%

specific (i.e. not predicting anyone from the healthy group as sick).

Optimization of the AUC

In [16] the optimized discriminating threshold is obtained through maximization of the the AUC. The AUC

for each value of the threshold is calculated through the approximation to the trapezoid by adding the

areas of the triangles and square shown in Figure 3.3, where an example ROC curve is shown as well

as the calculation of the AUC where its value is maximum. So a good value of AUC would be that with a

big value for TP rate while having a small one for FP rate, thus maximizing sensitivity and specificity.

Another option would be to select different weights for specificity and sensitivity, as used in [17]. It was

found that giving an equal weight to specificity and sensitivity lead to the same results.

16

p n

P N

TruePositives

FalsePositives

TrueNegatives

FalseNegatives

Y

N

True class

Hypothesized class

Figure 3.2: Confusion matrix.

0 0.5 10

0.2

0.4

0.6

0.8

1

FP rate

TP

rate

(a) ROC curve

0 0.5 10

0.2

0.4

0.6

0.8

1

FP rate

TP

ra

te

3

21

(b) AUC approximation

Figure 3.3: Finding ROC point AUC maximum by summing areas of the trapezoid: 1, 2 and 3.

3.1.3 Model layouts

When modeling a system, one can opt for a one-model-fits-all approach where a single general model

is built with all available training data, or for a multi-model solution. In the second one, there is the

additional problem of selecting which model to use for each data point. This paper proposes to use

the division of data obtained from unsupervised fuzzy clustering to obtain the multi-models. Thus, the

number of models is equal to the number of clusters.

The multi-model approach is compared to a single model, which was derived using the whole dataset.

The model that maximized the AUC was chosen as the best one, in order to balance specificity and

sensitivity.

17

A priori decision

The a priori decision scheme is based on cluster similarity. The criterion used for the choice of the model

was based on the distance of each point to the cluster center. Thus, as a point is closer to a given cluster

the output of the model is passed as shown in Figure 3.4, where M1, M2, ..., and Mn are the models.

For this case, 4 clusters were used and so we ended up with 4 models.

Figure 3.4: Multimodel scheme with decision a priori based on cluster centers.

A posteriori decision

In this scheme, the multi-model approach implements an a posteriori decision. Figure 3.5 shows the

proposed layout.

Figure 3.5: Multimodel scheme with decision a posteriori.

The decision is given by the model that has a higher difference between the model output Oi and a

threshold δi, see (3.7). This threshold is optimized for each model in order to maximize the AUC. This is

based on the hypothesis that a point further apart from the threshold is more accurate, as there is less

uncertainty.

Figure 3.6 shows the values taken by the output of the model prior to classification. Here, one can see

that the values are not 0 and 1 but take the values from -0.6 to 0.8. A threshold has to be applied to turn

these from a real value to a binary one. The value chosen was that which maximized the AUC.

18

0 1 2 3 4 5 6

x 104

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

(a) range of classifier values (b) Classifier threshold

Figure 3.6: Classifier values.

Latter, to choose which model to use, a similar idea was taken. After optimization the value chosen was

0.17 (Figure 3.6 b)). The criterion was based on the distance to the threshold (3.7), choosing at each

point the model in which the output is further apart to the threshold at which the class changes,

maxi|Oi − δi| (3.7)

where the threshold used was δi = 0.17, with Oi being the output of model i.

Another criterion tested was based on the distance from each point to the extremes of classification [0,1]

and choosing that which the difference from the point to the extreme was lowest.

19

Chapter 4

Preprocessing of MIMIC II Database

One of the essential parts of data mining is the access to a database. In real life there can be missing

data and other obstacles that one must tackle prior to fully use the dataset.

In this work we have data of a medical database known as MIMIC II, which will be introduced. An

overview of the actual subset used will be given along with information regarding the necessary prepro-

cessing and feature selection that was required prior to use the data.

4.1 MIMIC II Database

The MIMIC II (Multi-parameter Intelligent Monitoring in Intensive Care) Clinical Database is a database

composed of detailed information on more than 25,000 intensive care unit patients. It was initially com-

posed of the data from adult patients admitted at ICUs at Boston’s Beth Israel Deaconess Medical

Center during the period 2001-2007, an academic medical center in Boston with 620 beds, 77 of which

for critical care [39].

It is composed of two parts: The first is the MIMIC II Waveform Database, that includes bedside monitor

trends and waveforms and that is freely available. The second is the MIMIC II Clinical Database, that

includes all other elements of MIMIC II like patient demographics, physiological measures, list of pro-

cedures, medications, lab tests, fluid balance, notes and staff reports; and that is available to qualified

researchers who obtain human subjects training and under the terms of a data use agreement concern-

ing issues of human research and privacy. This information can be queried or downloaded from the

database website.1

Preprocessing was undertaken to improve data quality. Missing data was imputed consistent with the

accepted last value carried forward method [40, 41, 42]. Outliers were addressed using the inter-quartile

1http://physionet.org/mimic2/

21

range method [43]. Normalization of the data used the min-max procedure. Finally, data was aligned

with a gridding approach based on heart rate sampling frequency [44].

4.2 Vasopressors subset

In the clinical practice, it is common to attribute to each patient a series of ICD-9 codes. This is a medical

coding in which every health condition (sign, symptom or disease) is assigned a unique code, grouped

by similar conditions. These codes were used to select two specific group of patients: pancreatitis2 or

pneumonia3. These are two conditions prone to the development of systemic shock and that end up

requiring the possible use of fluids and vasopressor agents.

For the selection of patients, a set of variables usually obtained in the ICU by non invasive means were

chosen, along with the indication of times (in hours) between samples, as indicated in Table 4.1.

# Variables (units) 95 % CI for time between samples (hours)1 Heart Rate (beats/min) 0.85 - 0.862 Temperature (C) 2.64 - 2.683 SpO2 (%) 0.85 - 0.874 Respiratory Rate (breaths/min) 0.90 - 0.915 GCS Total 3.30 - 3.336 Braden Score 7.28 - 7.467 Hematocrit (%) 13.0 - 13.58 Platelets (cells/L) 16.6 - 17.09 WBC - White Blood Cells (103/mL) 17.3 - 17.8

10 Hemoglobin (g/L) 16.9 - 17.511 RBC - Red Blood Cells (106/mL) 17.4 - 17.912 BUN - Blood urea nitrogen (mg/dL) 14.7 - 15.213 Creatinine (mg/dL) 14.7 - 15.114 Glucose (mg/dL) 08.7 - 09.015 Potassium (mEq/L) 10.6 - 10.916 Chloride (mEq/L) 14.7 - 15.117 Sodium (mEq/L) 13.8 - 14.318 Magnesium (mg/dL ) 14.1 - 14.619 NBP - Non-invasive blood pressure (mmHg) 1.13 - 1.1620 NBP Mean (mmHg) 1.14 - 1.1721 Arterial pH 6.45 - 6.7322 Arterial Base Excess (mEq/L) 6.57 - 6.8623 Lactic Acid (mg/dL) 13.0 - 14.4624 Urine Output (mL) 1.17 - 1.19

Table 4.1: Physiological variables.

Also a series of variables connected to the patient information obtained on ICU admission were grouped,

as indicated in Table 4.2.

2Pancreatitis ICD-9 codes: 577.0 ; 577.1 ; 577.2 ; 577.8 ; 577.9 ; 579.43Pneumonia ICD-9 codes: 003.22 ; 020.3 ; 020.4 ; 020.5 ; 021.2 ; 022.1 ; 031.0 ; 039.1 ; 052.1 ; 055.1 ; 073.0 ; 083.0 ; 112.4 ;

114.0 ; 114.4 ; 114.5 ; 115.05 ; 115.15 ; 115.95 ; 130.4 ; 136.3 ; 480.0 ; 480.1 ; 480.2 ; 480.3 ; 480.8 ; 480.9 ; 481 ; 482.0 ; 482.1; 482.2 ; 482.3 ; 482.30 ; 482.31 ; 482.32 ; 482.39 ; 482.4 ; 482.40 ; 482.41 ; 482.42 ; 482.49 ; 482.8 ; 482.81 ; 482.82 ; 482.83 ;482.84 ; 482.89 ; 482.9 ; 483 ; 483.0 ; 483.1 ; 483.8 ; 484.1 ; 484.3 ; 484.5 ; 484.6 ; 484.7 ; 484.8 ; 485 ; 486 ; 513.0 ; 517.1

22

# Variable Remark1 Patient ID2 Age at ICU admission3 Sex 0 if female, 1 if male4 Mortality 1 if patient died while in the ICU5 Hospital time stay6 ICU time stay7 SAPS score at ICU admission8 SOFA score at ICU admission9 Vasopressor administration 1 if it was administered

Table 4.2: Static variables.

For these records, there was also a binary variable that recorded if, on that each instant, a given patient

had vasopressors being administered, or not. This served as an output variable for the prediction of

vasopressor need.

4.3 Preprocessing

As with any real database, there were a few steps that had to be made prior to use the data. In particular

due to the presence of missing data, outliers and data synchronization.

In order to tackle differences in frequency of collection, the Heart Rate signal was used as a template

variable to align the remaining variables, since it was the most frequently measured variable. This pro-

cess was presented in [44] with more detail. In particular the values were interpolated and chosen the

points that were in sync with this template variable.

Regarding missing data, the chosen procedure was to impute recoverable missing segments by cubic

interpolation.

Additionally, In the de-identification step of the MIMIC II Database, patients whose age was higher than

90 were set the value 200, as visible outlier. We then changed them to 92 so as not to weight too much

on the data mining processes.

4.4 Feature Selection

Normally for the task of data mining one usually focuses on a subgroup of variables rather then the

complete set. This can be for various reasons. Too many variables can lead to redundancy which can

lower the prediction performance. Also it can be for computational purposes as less variables leads to

23

decrease in time of computation, and there is often quite a lot of data in data mining applications.

For the selection of which variables to discard on feature selection one could apply a process of discov-

ery of which variables, separately or in group, have a higher predictive power and so avoid taking out the

most important ones and in turn diminishing the prediction rate. In [16] several techniques were used

select these variables using a similar dataset to the problem of vasopressor prediction. In particular both

a bottom-up and and a top-down tree search approaches and an ant colony optimization method were

used without significant loss of performance.

In this work this step of selecting a optimum subset of features was not done. We focused instead on a

particular subset of 5 variables that were more frequently sampled, as shown in Table 4.3. These were

also the variable subset used for the clustering process.

ID Variables (units)1 Heart Rate (beats/min)3 SpO2 (%)4 Respiratory Rate (breaths/min)19 NBP - Non-invasive blood pressure (mmHg)24 Urine Output (mL)

Table 4.3: Chosen data features with the low sample times.

Since the preprocessing stage required all variables to be synchronized to the heartbeat rate [44], in

those features where sample time was the largest, thus less frequently sampled, some of the records

were obtained through interpolation and so they were more prone to errors. These features had a sam-

ple time similar to the template variable and so had their values less influenced by the interpolation

process.

Also, there were initially 1489 patients, but a significant number of patients with too few records was

found, as it can be seen in Figure 4.1. So the removal of those with less than three was sought, ending

with 1220 patients, 80 of which with pancriatitis and 430 with pneumonia.

24

0 20 40 60 80 100 1200

50

100

150

200

250

# of samples

# o

f occurr

ences

Figure 4.1: Patient records frequency plot of the first 100 entries.

4.5 Fluid subset

Another data set was made considering a selection of patients that were given fluids, one of the possible

treatments for shock and usually the step done before the administration of vasopressors. Thus the

problem would be the prediction of the use of vasopressor drugs on patients that were given fluids for

the treatment of shock. In this dataset there are more patients (2944 in total), less features are used and

a reduced number of vasopressors are included.

25

Chapter 5

Clustering of MIMIC II Database

Once the data mining method was selected, the steps to the actual knowledge retrieval was taken,

through the clustering of the MIMIC II vasopressors data subset.

This chapter is divided as follows: First a pattern in the data that gave us an indication of the number of

clusters that the data could partitioned into was searched for. For this goal a few validation measures

were calculated for a sequence of cluster numbers, as discussed in section 2.4. Then the data was

progressively reduced to a smaller selection that gave clearer results. Finally the clustering solution was

evaluated through the analysis of cluster histograms and cluster centers.

5.1 Clustering

On Chapter 2 clustering was introduced, and fuzzy c-means (FCM) clustering method was described in

more detail. This was the method chosen for the reasons indicated, in particular because it is a stan-

dard clustering method, simple and easily extended. It makes sense to start initially with FCM and then

choose a more advanced algorithm to tackle any specific hindrance if obstacles arise.

Another clustering algorithm was also used initially, Gustafson-Kessel (GK), and tests were carried out

for all the mentioned validation measures. As there were no improvements compared to FCM, it was

dropped in favor of the latter. If the reader is interested, these results are available in Annex A. GK was

also found to be computationally more expensive.

In all the tests, FCM algorithm was used with the parameters m=1.8 and minimum amount of improve-

ment, e=0.01. In [45] a fuzzyness parameter between 1.5 and 2.5 is advised. In this work the value of

2.0 was initially used but changed to 1.8 because in the initial steps the clusters centers were found too

near to each other. One can look at the relative position of the calculated centers in Annex C.

27

FCM algorithm requires the user to indicate the number of clusters to partition the data into. So, for a

series of runs, validation measures were computed for a given range of number of clusters so as to find

this cluster number, as addressed in section 2.4.

5.1.1 Full dataset

First using the complete data set, without any feature selection, clustering was attempted with the 24

features and all patient records. Validation measures were calculated for a batch of 20 runs for the

number of clusters ranging from 2 to 10.

Figure 5.1 shows the mean values for the validity measures Partition Coefficient, Classification Entropy,

Partition Index, Separation Index, and Xie-Beni Index in a batch of 20 runs for each number of clusters

tested. The trends were monotonic, there was no clear meanimum for the graphs of partition coefficient,

partition index, separation index and Xie-Beni index, while there was no maximum for classification en-

tropy.

Therefor, the results presented no clear indication of a number of groups in the data.

5.1.2 First data reduction

Work was proceeded by reducing the dataset used for clustering, using a smaller subset of the whole

data instead. For this goal we took the mean of the last three records for each patient with more than 3

recorded samples. This was one of the reasons why entries with less than three records were discarded

initially, as stated in section 4.4. We ended up with one point per patient, thus 1220 points instead of the

whole dataset were used.

The results are shown in Figure 5.2, where it can be seen that the results were relatively high for the

low number of clusters, especially in Figure 5.2 c) though it could be enough to identify 5 as a possible

candidate for the number of clusters, in Figure 5.2 a) and b).

Here the validation measures Partition Coefficient and Classification Index were now dropped, as there

were no notable difference in the following plots, either before or after data reduction.

5.1.3 Clustering with high frequency data

Nonetheless, work was proceeded by further reducing the data set, by reducing, the number of features

used. In particular they were reduced to those with the least sample times, as described in Table 4.3 of

28

2 4 6 8 100.1

0.2

0.3

0.4

0.5

0.6

Nc

PC

(a) Partition coeffiecient (PC)

2 4 6 8 10

0.8

1

1.2

1.4

1.6

1.8

2

2.2

2.4

Nc

CE

(b) Classification entropy (CE)

2 4 6 8 100

500

1000

1500

2000

2500

3000

3500

4000

4500

Nc

Sc

(c) Partition (Sc) index

2 4 6 8 100

20

40

60

80

100

120

Nc

S

(d) Separation (S) index

2 4 6 8 100.2

0.4

0.6

0.8

1

1.2

1.4

1.6

Nc

XB

(e) Xie-Beni (XB) index

Figure 5.1: Validation measures for a set of 20 runs for each number of clusters between 2 and 20(mean and variance) using the full feature set.

Section 4.4, each sampled approximately once every hour.

The explanation is as follows. Since the preprocessing stage required all variables to be synchronized

to the heartbeat rate [44], in those features where sample times were largest (thus less frequently sam-

pled) some of the records were obtained through interpolation in the alignment process and so were

more prone to errors.

Again, the validity measures Partition Index, Separation Index, and Xie-Beni Index were obtained, with

the means and variance shown in Figure 5.3. Now there was a clear minima in Figure 5.3 a) and b)

29

2 4 6 8 100

0.5

1

1.5

2

2.5x 10

4

Nc

Sc

(a) (SC) index

2 4 6 8 100

2

4

6

8

10

12

14

16

18

Nc

S

(b) (S) index

2 4 6 8 100.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Nc

XB

(c) Xie-Beni (XB) index

Figure 5.2: Validation measures for FCM clustering of the last three records for every patient and thefive most frequently sampled features.

though not quite so clear in c).

30

2 4 6 8 106

7

8

9

10

11

12

13

Nc

Sc

(a) Partition (Sc) Index

2 4 6 8 106.5

7

7.5

8

8.5

9

9.5

10x 10

−3

Nc

S

(b) Separation (S) Index

2 4 6 8 100.5

1

1.5

2

2.5

3

3.5

4

Nc

XB

(c) Xie-Beni (XB) Index

Figure 5.3: Validation measures (mean and variance) for the reduced set, and most sampled variables.

5.2 Cluster Evaluation

Now that a number of clusters was obtained the next step was to find out which information was stored

in these four clusters. For this goal, each point was assigned to the cluster were the fuzzy membership

was higher.

5.2.1 Clusters obtained

In order to evaluate the clusters obtained, a more careful study of the clusters was made. Figure 5.4 a)

shows an histogram of the patients divided by cluster where it can be seen that that the patients are well

divided among the clusters. Figure 5.4 b) shows that despite the seemed evenly distribution of patients,

a careful look at which of them have been administered with vasopressors, our output variable, we see

that an higher amount of patients with vasopressors are located in the forth cluster while remarkably

less on the first, while average on the second and third clusters.

31

1 2 3 40

50

100

150

200

250

300

350

cluster

fre

qu

en

cy

(a) Patients in cluster

1 2 3 40

50

100

150

200

250

300

350

cluster

fre

qu

en

cy

with

without

(b) patients with and without vasopressors

Figure 5.4: Patients in cluster and vasopressor distribution.

5.2.2 Cluster centers

The centers of the clusters were calculated, as shown in Table 5.1, where we first started to analyze the

static variables, those that were acquired on ICU entrance or exit and that remained unchanged.

cluster patients age sex{0,1} mortality{0,1} SOFA vassopres.{0,1}1 332 61.78 0.63 0.37 2.30 0.152 278 57.97 0.48 0.45 2.44 0.283 298 66.06 0.54 0.50 2.41 0.394 312 61.84 0.53 0.47 2.62 0.53

Table 5.1: Number of patients and features mean values for each cluster.

Again it could be seen that the clusters were evenly distributed in terms of patients, ages and sexes.

Cluster 3 had slightly higher mortality which could be explained by the slightly higher age mean and a

considerable amount of patients with vasopressors administred - 40%. Cluster 4, the highest in terms of

vasopressor administration had the second highest value of mortality, then cluster 2 and finally cluster 1,

with the least amount of patients with vasopressors. Thus it is confirmed that shock is a leading cause

of death in an ICU.

5.2.3 Main features histograms

If we turn instead to the relationship between cluster assignment and feature values in Figure 5.5 more

insightful information can be retrieved. The mean of the last three values of each feature of each patient

was obtained and histograms describing feature distribution by clusters were drawn.

In terms of heart rate, in Figure 5.5 a), there is a significant difference between clusters 2 and 3 and only

a slight difference between clusters 1 and 4. It can be seen that patients with mean to high heart rate

are assigned to cluster 2 while to with mean to low are assigned to cluster 3.

For SpO2% feature, the higher difference is between clusters 2 and 4, the first with mean low values,

the second with mean high SpO2%, while mean values for the remain clusters 1 and 3.

32

If we change for the breaths/min feature, it can be seen that the while patients with mean high values of

respiratory rate are assigned to the first clusters 1 and 2, for cluster 4 the assignment is done for patients

with mean-low rate.

As for non invasive blood pressure (NBP), cluster 1 has patients with mean-high NBP, while clusters 3

and 4 have assigned to them patients with mean-low NBP.

Finally in Figure 5.5 e) we can see that low values of urine output are more common in the clusters with

higher amount of patients with administered vasopressors, clusters 3 and 4, while on cluster 1, where

there is a lower proportion of patients with vasopressors, the mean is around 0.5 with equal amount of

patients with higher or lower values.

Main findings from Figure 5.5 follow that:

i Patients with the need of vasopressors are assigned in decreasing order to cluster 4, 3 and 2, which

common characteristic is a low urine output. Thus this variable alone would be enough to indicate

risk.

ii Since cluster 4 has the highest number of patients with vasopressors one can say that are at risk

patients with the need of vasopressors to have low urine output and non-invasive blood pressure,

along with a low respiratory rate that has high SpO2%, independent of the hear rate.

iii Cluster 1 has assigned to it most of the patients without need of vasopressores. This group char-

acteristics are as follows: patients with high blood pressure and respiratory rate, independent of the

heart rate, or urine output. Also in contrast to the other clusters, a patient with high urine output or

NBP, would be sufficient to assign a low probability of the need of vasopressors.

iv Cluster 3 patients exhibit a low heart rate, are non the less in high risk, so this variable alone is not

sufficient to indicate shock. It should be checked if there is also low NBP or low urine.

5.2.4 Demographics

In Figure 5.6 a) one can see that in terms of demography age is not an indication of risk, but one must

take note that it doesn’t mean that the treatment in terms of success is independent of the age, which is

a different thing.

5.2.5 Clusters by pathology

One can see that as expected the is an increase in the incidence of SIRS in the clusters with more

patients with vasopressor, given that is a common treatment for this condition. Not quite expected are

33

0 0.5 10

50

100

150

heart rate mean

frequency

cluster 1

0 0.5 10

50

100

150

heart rate mean

frequency

cluster 2

0 0.5 10

50

100

150

heart rate mean

frequency

cluster 3

0 0.5 10

50

100

150

heart rate mean

frequency

cluster 4

(a) heart rate

0 0.5 10

50

100

SpO2 %

frequency

cluster 1

0 0.5 10

50

100

SpO2 %

frequency

cluster 2

0 0.5 10

50

100

SpO2 %

frequency

cluster 3

0 0.5 10

50

100

SpO2 %

frequency

cluster 4

(b) SpO2%

0 0.5 10

50

100

150

breaths/min

frequency

cluster 1

0 0.5 10

50

100

150

breaths/min

frequency

cluster 2

0 0.5 10

50

100

150

breaths/min

frequency

cluster 3

0 0.5 10

50

100

150

breaths/minfr

equency

cluster 4

(c) Respiratory rates (breaths / min)

0 0.5 10

50

100

150

NBP

frequency

cluster 1

0 0.5 10

50

100

150

NBP

frequency

cluster 2

0 0.5 10

50

100

150

NBP

frequency

cluster 3

0 0.5 10

50

100

150

NBP

frequency

cluster 4

(d) Non invasive blood pressure

0 0.5 10

50

100

150

Urine output

frequency

cluster 1

0 0.5 10

50

100

150

Urine output

frequency

cluster 2

0 0.5 10

50

100

150

Urine output

frequency

cluster 3

0 0.5 10

50

100

150

Urine output

frequency

cluster 4

(e) Urine output

Figure 5.5: Most frequently sampled features histograms.

the results of Figures 5.7 b) and c) where the incidence of both pneumonia and pancriatitis is generally

the same among clusters, while it is known that these are conditions that can evolve to SIRS, or sepsis

in the case of pneumonia.

34

10 30 50 70 900

100

200

age

frequency

cluster 1

10 30 50 70 900

100

200

agefr

equency

cluster 2

10 30 50 70 900

100

200

age

frequency

cluster 3

10 30 50 70 900

100

200

age

frequency

cluster 4

(a) age

0 0.5 10

100

200

sex

frequency

cluster 1

0 0.5 10

100

200

sex

frequency

cluster 2

0 0.5 10

100

200

sex

frequency

cluster 3

0 0.5 10

100

200

sex

frequency

cluster 4

(b) sex (1 - male, 0 - female)

Figure 5.6: Patient distribution per cluster by demographics.

1 2 3 40

50

100

150

200

250

300

350

SIRS

num

ber

(a) SIRS

1 2 3 40

50

100

150

200

250

300

350

Pnemonia

num

ber

with

without

(b) Pneumonia

1 2 3 40

50

100

150

200

250

300

350

Pancriatitis

frequency

(c) Pancriatitis

Figure 5.7: Patient distribution per cluster by pathology.

35

Chapter 6

Results

There were two approaches taken. First a one fits-all approach were a single model for all data points

was made and afterwards a combination of models from the clusters obtained in the previous section

(Section 5) . This multi-model approach was then compared with the first single model and the results

were discussed below.

First the general single-model case results are presented followed by multi-model results and finally the

best of all results are presented.

6.1 Single-model Results

For the one-fits-all approach, the following results were obtained after 20 runs, using 60% of the data for

training and 40% for test (Table 6.1). This number of tests was chosen to balance both the computational

time and model consistency.

five features 24 features p-valuesAUC 0.71 ± 0.01 0.80 ± 0.01 < 0.05specificity 0.71 ± 0.02 0.81 ± 0.03 < 0.05sensitivity 0.72 ± 0.03 0.80 ± 0.03 < 0.05accuracy 0.72 ± 0.03 0.80 ± 0.02 < 0.05

Table 6.1: Single-model results.

Table 6.1 shows that model built upon only five features has significantly lower performance (p < 0.05)

that using the whole 24 feature set. The five features were initially chosen as those more readily avail-

able in the ICU, as they had an higher sample rate. No feature selection step was made, so the lower

performance can mean that information was lost as some important variables were left out. In [16] a

feature selection step was made for a similar dataset and the variables White blood cells (WBC) and

sodium were found to have an important role in use of vasopressors prediction.

37

6.2 Multi-model Results

Table 6.2 shows the results belonging to both approaches taken - an a priori decision model based on

clusters centers and a posteriori decision, based on the model output. Results show the mean and

standard deviation after 20 runs, using 60% of the data for training and 40% for test. The a posteriori

approach returned significantly better classification performance shown by the significantly better accu-

racy, but also significantly better specificity and sensitivity (p < 0.05). Thus it returned, on one hand,

a better proportion of positives correctly classified as such, and on the other hand, a lesser number of

false alarm rate as it has more negatives also correctly classified as such.

a priori a posteriori p-valuesAUC 0.81 ± 0.01 0.85 ± 0.00 < 0.05

specificity 0.79 ± 0.06 0.85 ± 0.01 < 0.05sensitivity 0.82 ± 0.06 0.84 ± 0.01 < 0.05accuracy 0.80 ± 0.03 0.85 ± 0.01 < 0.05

Table 6.2: Multi-model results after optimization.

In addition, while using the reduced set to build a general model returned worse results, using the model

obtained through clustering with the same information, it then lead to a multi-model scheme that returned

good results in comparison.

Lower results were expected in the case of the a priori model, as the models were based on the clus-

tering of a reduced dataset suggesting a possible loss of information, lowering the prediction rate of

the models used. The overall result was nonetheless significantly better than the one generic model

obtained from the five feature dataset of Table 6.1.

6.3 Best results comparison

Table 6.3 shows the compilation of both single-model and the best multi-model results. The best of the

multi-model schemes returned significantly better performance (p < 0.05) than the single-model case.

single-model multi-model p-valuesAUC 0.80 ± 0.01 0.85 ± 0.00 < 0.05

specificity 0.81 ± 0.03 0.85 ± 0.01 < 0.05sensitivity 0.80 ± 0.03 0.84 ± 0.01 < 0.05accuracy 0.80 ± 0.02 0.85 ± 0.01 < 0.05

Table 6.3: Single-model and multi-model results.

38

Chapter 7

Conclusions

This work proposed two decision approaches for a multi-model layout based on fuzzy clustering. The

two decision criteria proposed are: 1) an a priori decision based on the distance from the clusters cen-

ters to the patient characteristics, and 2) an a posteriori decision approach where each model was used

and the final outcome is based on the uncertainty of the output response to the threshold of each model.

Fuzzy clustering proved to be a suitable tool for finding patterns in the patients data, as relationship

between the obtained clusters and the output variable (administration of vasopressors) was found. The

information gathered was insightful in the light of the medical information. Some conclusions could be

taken by the analysis of the patients characteristics in each identified cluster. This information was used

to obtain patient specific models.

Significantly better results were obtained using a multi-model scheme than using a general one, to pre-

dict if a patient needs vasopressors. This suggests that using patient specific models is more appropriate

than a single general model.

In terms of decision criteria used - decision a priori, and decision a postriori - the results were signif-

icantly better for the a posteriori decision rule. Nonetheless, more tests with additional datasets are

recomended before establishing one approach as better than the other.

As future work, the multi-model approach can be applied to other applications beyond clinical data. The

variables used could be selected through a process of feature selection to avoid possible information

loss and to improve performance.

Regarding further extensions to this work, the new fluids vasopressor data set could also be used and

compared, while another attempt at clustering with more clusters could be performed in hope of finding

a cluster with a more significant amount of patients with vasopressors.

39

Appendix A

Data Partitioning - GK Results

In the course of the data partitioning study, the Gustafson-Kesseln (GK) clustering algorithm was also

used. After acquiring no better results that the FCM algorithm, it was later dropped in favor of the latter.

The results are nonetheless presented below for reference.

As done for the FCM case, all the results were obtained from a batch of 20 and selecting the mean

values, for a series of clusters ranging from 2 to 20.

A.1 All data All features

For all the data points and features, the same group of validity measures, partition index, separation

index and Xie-Beni index, were calculated for the GK Clustering was as follows (Figure A.1).

2 4 6 8 10 12 14 16 18 200

0.5

1

1.5

2

2.5

3x 10

9 Partition Index

Nc

Sc

(a) Partition Index (Sc)

2 4 6 8 10 12 14 16 18 200

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5x 10

4 Separation Index

Nc

S

(b) Separation Index (S)

2 4 6 8 10 12 14 16 18 200

0.2

0.4

0.6

0.8

1

1.2

1.4Xie Beni index

Nc

XB

(c) Xie-Beni index (XB)

Figure A.1: Validation measures for the GK clustering of the full data set.

A.2 All Features Last point

A second case was studied, using only the last time sample for each patient. The results are shown in

Figure A.2 .

41

2 4 6 8 10 12 14 16 18 200

10

20

30

40

50

60Partition Index

Nc

Sc

(a) (SC) index

2 4 6 8 10 12 14 16 18 200

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1x 10

−3 Separation Index

Nc

S

(b) (S) index

2 4 6 8 10 12 14 16 18 200.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1.1

1.2Xie Beni index

Nc

XB

(c) (XB) index

Figure A.2: Validation measures using the GK clustering for the last 3 records of each patient and allfeatures.

A.3 5Features

Latter the clustering process was applied to the complete data (including all patient data samples) but

for only the five most frequently sampled features, as done in the FCM case, with the following results in

terms of validity measures per number of clusters (Figure A.3).

2 4 6 8 10 12 14 16 18 200

10

20

30

40

50

60Partition Index

Nc

Sc

(a) (SC) index

2 4 6 8 10 12 14 16 18 200

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1x 10

−3 Separation Index

Nc

S

(b) (S) index

2 4 6 8 10 12 14 16 18 200

1

2

3

4

5

6

7Xie Beni index

Nc

XB

(c) (XB) index

Figure A.3: Validation measures for GK Clustering of last 3 records of each patient and the 5 featuresubset data.

A.4 LastRecords5Features

Finally a combination of the last two steps was tried out, thus for the last time samples of each patient

and for only the five most frequently sampled features. Figure A.4 shows the results of the validity mea-

sures for this test. In this case, only the numbers of clusters ranging from 2 to 10 was tested, instead of

up to 20 as in the other sections.

42

1 2 3 4 5 6 7 8 9 10 110

2

4

6

8

10

12x 10

5 Partition Index

Nc

Sc

(a) (SC) index

1 2 3 4 5 6 7 8 9 10 11200

400

600

800

1000

1200

1400

1600

1800Separation Index

Nc

S

(b) (S) index

1 2 3 4 5 6 7 8 9 10 110.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2Xie Beni index

Nc

XB

(c) (S) index

Figure A.4: Validation measures for GK Clustering of the 5 feature subset data and mean of the lastthree records.

43

Appendix B

Cluster Histograms

The mean of the last three values of each feature of each patient was obtained and histograms describ-

ing feature distribution by clusters were drawn.

B.1 Physiological variables

First the mean values of each feature for each patient were gathered in Table B.1, followed by the

histograms for all 24 variables (Figures B.1 to B.24).

# cluster 1 cluster 2 cluster 3 cluster 41 0.44 0.72 0.30 0.582 0.75 0.73 0.76 0.703 0.46 0.41 0.47 0.664 0.60 0.63 0.54 0.375 0.60 0.55 0.51 0.486 0.33 0.32 0.39 0.407 0.34 0.31 0.37 0.358 0.53 0.61 0.48 0.539 0.59 0.53 0.52 0.5310 0.50 0.55 0.44 0.4811 0.54 0.59 0.55 0.4512 0.54 0.50 0.52 0.5213 0.52 0.50 0.50 0.5014 0.53 0.47 0.45 0.4615 0.50 0.52 0.52 0.5516 0.50 0.49 0.50 0.4917 0.51 0.50 0.51 0.5118 0.52 0.47 0.49 0.4919 0.67 0.49 0.43 0.3720 0.62 0.50 0.42 0.3921 0.59 0.54 0.53 0.4622 0.57 0.51 0.48 0.4323 0.27 0.33 0.35 0.3924 0.49 0.30 0.18 0.19

Table B.1: Mean of physiological variables by cluster

45

0 0.5 10

50

100

150

Heart ratefr

equency

cluster 1

0 0.5 10

50

100

150

Heart rate

frequency

cluster 2

0 0.5 10

50

100

150

Heart rate

frequency

cluster 3

0 0.5 10

50

100

150

Heart rate

frequency

cluster 4

Figure B.1: Heart rate distribution per cluster.

0 0.5 10

50

100

Temperature

frequency

cluster 1

0 0.5 10

50

100

Temperature

frequency

cluster 2

0 0.5 10

50

100

Temperature

frequency

cluster 3

0 0.5 10

50

100

Temperature

frequency

cluster 4

Figure B.2: Temperature distribution per cluster.

0 0.5 10

50

100

150

SpO2

frequency

cluster 1

0 0.5 10

50

100

150

SpO2

frequency

cluster 2

0 0.5 10

50

100

150

SpO2

frequency

cluster 3

0 0.5 10

50

100

150

SpO2fr

equency

cluster 4

Figure B.3: SpO2 distribution per cluster.

0 0.5 10

50

100

150

Respiratory rate

frequency

cluster 1

0 0.5 10

50

100

150

Respiratory rate

frequency

cluster 2

0 0.5 10

50

100

150

Respiratory rate

frequency

cluster 3

0 0.5 10

50

100

150

Respiratory rate

frequency

cluster 4

Figure B.4: Respiratory rate distribution per cluster.

0 0.5 10

50

100

150

GCS total

frequency

cluster 1

0 0.5 10

50

100

150

GCS total

frequency

cluster 2

0 0.5 10

50

100

150

GCS total

frequency

cluster 3

0 0.5 10

50

100

150

GCS total

frequency

cluster 4

Figure B.5: GCS total distribution per cluster.

0 0.5 10

50

100

150

Braden score

frequency

cluster 1

0 0.5 10

50

100

150

Braden score

frequency

cluster 2

0 0.5 10

50

100

150

Braden score

frequency

cluster 3

0 0.5 10

50

100

150

Braden score

frequency

cluster 4

Figure B.6: Braden score distribution per cluster.

46

0 0.5 10

50

100

150

Hematocritfr

equency

cluster 1

0 0.5 10

50

100

150

Hematocrit

frequency

cluster 2

0 0.5 10

50

100

150

Hematocrit

frequency

cluster 3

0 0.5 10

50

100

150

Hematocrit

frequency

cluster 4

Figure B.7: Hematocrit distribution per cluster.

0 0.5 10

50

100

150

Platelets

frequency

cluster 1

0 0.5 10

50

100

150

Platelets

frequency

cluster 2

0 0.5 10

50

100

150

Platelets

frequency

cluster 3

0 0.5 10

50

100

150

Platelets

frequency

cluster 4

Figure B.8: Platelets distribution per cluster.

0 0.5 10

50

100

150

WBC

frequency

cluster 1

0 0.5 10

50

100

150

WBC

frequency

cluster 2

0 0.5 10

50

100

150

WBC

frequency

cluster 3

0 0.5 10

50

100

150

WBCfr

equency

cluster 4

Figure B.9: White blood cells (WBC) distribution per cluster.

0 0.5 10

50

100

150

Hemoglobin

frequency

cluster 1

0 0.5 10

50

100

150

Hemoglobin

frequency

cluster 2

0 0.5 10

50

100

150

Hemoglobin

frequency

cluster 3

0 0.5 10

50

100

150

Hemoglobin

frequency

cluster 4

Figure B.10: Hemoglobin distribution per cluster.

0 0.5 10

50

100

150

RBC

frequency

cluster 1

0 0.5 10

50

100

150

RBC

frequency

cluster 2

0 0.5 10

50

100

150

RBC

frequency

cluster 3

0 0.5 10

50

100

150

RBC

frequency

cluster 4

Figure B.11: Red blood cells (RBC) distribution per cluster.

0 0.5 10

50

100

150

BUN

frequency

cluster 1

0 0.5 10

50

100

150

BUN

frequency

cluster 2

0 0.5 10

50

100

150

BUN

frequency

cluster 3

0 0.5 10

50

100

150

BUN

frequency

cluster 4

Figure B.12: Blood urea nitrogen (BUN) distribution per cluster.

47

0 0.5 10

50

100

150

Creatinine

frequency

cluster 1

0 0.5 10

50

100

150

Creatinine

frequency

cluster 2

0 0.5 10

50

100

150

Creatinine

frequency

cluster 3

0 0.5 10

50

100

150

Creatinine

frequency

cluster 4

Figure B.13: Creatinine distribution per cluster.

0 0.5 10

50

100

150

Glucose

frequency

cluster 1

0 0.5 10

50

100

150

Glucose

frequency

cluster 2

0 0.5 10

50

100

150

Glucose

frequency

cluster 3

0 0.5 10

50

100

150

Glucose

frequency

cluster 4

Figure B.14: Glucose distribution per cluster.

0 0.5 10

50

100

150

Potassium

frequency

cluster 1

0 0.5 10

50

100

150

Potassium

frequency

cluster 2

0 0.5 10

50

100

150

Potassium

frequency

cluster 3

0 0.5 10

50

100

150

Potassium

frequency

cluster 4

Figure B.15: Potassium distribution per cluster.

0 0.5 10

50

100

150

Chloride

frequency

cluster 1

0 0.5 10

50

100

150

Chloride

frequency

cluster 2

0 0.5 10

50

100

150

Chloride

frequency

cluster 3

0 0.5 10

50

100

150

Chloride

frequency

cluster 4

Figure B.16: Chloride distribution per cluster.

0 0.5 10

50

100

150

Sodium

frequency

cluster 1

0 0.5 10

50

100

150

Sodium

frequency

cluster 2

0 0.5 10

50

100

150

Sodium

frequency

cluster 3

0 0.5 10

50

100

150

Sodium

frequency

cluster 4

Figure B.17: Sodium distribution per cluster).

48

0 0.5 10

50

100

150

Magnesium

frequency

cluster 1

0 0.5 10

50

100

150

Magnesium

frequency

cluster 2

0 0.5 10

50

100

150

Magnesium

frequency

cluster 3

0 0.5 10

50

100

150

Magnesium

frequency

cluster 4

Figure B.18: Magnesium distribution per cluster.

0 0.5 10

50

100

150

NBP

frequency

cluster 1

0 0.5 10

50

100

150

NBP

frequency

cluster 2

0 0.5 10

50

100

150

NBP

frequency

cluster 3

0 0.5 10

50

100

150

NBP

frequency

cluster 4

Figure B.19: Non-invasive blood pressure (NBP) distribution per cluster.

0 0.5 10

50

100

150

NBP mean

frequency

cluster 1

0 0.5 10

50

100

150

NBP mean

frequency

cluster 2

0 0.5 10

50

100

150

NBP mean

frequency

cluster 3

0 0.5 10

50

100

150

NBP meanfr

equency

cluster 4

Figure B.20: NBP mean distribution per cluster.

0 0.5 10

50

100

150

Arterial pH

frequency

cluster 1

0 0.5 10

50

100

150

Arterial pH

frequency

cluster 2

0 0.5 10

50

100

150

Arterial pH

frequency

cluster 3

0 0.5 10

50

100

150

Arterial pH

frequency

cluster 4

Figure B.21: Arterial pH distribution per cluster.

0 0.5 10

100

Arterial base excess

frequency

cluster 1

0 0.5 10

100


frequency

cluster 2

0 0.5 10

100


frequency

cluster 3

0 0.5 10

100


frequency

cluster 4

Figure B.22: Arterial base excess distribution per cluster.

0 0.5 10

50

100

150

Lactic acid

frequency

cluster 1

0 0.5 10

50

100

150

Lactic acid

frequency

cluster 2

0 0.5 10

50

100

150

Lactic acid

frequency

cluster 3

0 0.5 10

50

100

150

Lactic acid

frequency

cluster 4

Figure B.23: Lactic Acid distribution per cluster.

49

0 0.5 10

50

100

150

Urine output

frequency

cluster 1

0 0.5 10

50

100

150

Urine output

frequency

cluster 2

0 0.5 10

50

100

150

Urine output

frequency

cluster 3

0 0.5 10

50

100

150

Urine output

frequency

cluster 4

Figure B.24: Urine output distribution per cluster.

50

B.2 Variables at ICU entrance/exit

Then the same was applied to variables obtained at ICU entrance, age, sex, SOFA score, and those

obtained at ICU exit, such as mortality (Figures B.25 to B.28).

10 30 50 70 900

100

200

age

frequency

cluster 1

10 30 50 70 900

100

200

agefr

equency

cluster 2

10 30 50 70 900

100

200

age

frequency

cluster 3

10 30 50 70 900

100

200

age

frequency

cluster 4

Figure B.25: Age distribution per cluster.

0 0.5 10

100

200

sex

frequency

cluster 1

0 0.5 10

100

200

sex

frequency

cluster 2

0 0.5 10

100

200

sex

frequency

cluster 3

0 0.5 10

100

200

sex

frequency

cluster 4

Figure B.26: Sex distribution per cluster (M-1, F-0).

0 0.5 10

100

200

mortality

frequency

cluster 1

0 0.5 10

100

200

mortality

frequency

cluster 2

0 0.5 10

100

200

mortality

frequency

cluster 3

0 0.5 10

100

200

mortality

frequency

cluster 4

Figure B.27: Mortality distribution per cluster (1 if patient died during ICU stay).

0 2 40

100

200

SOFA score

frequency

cluster 1

0 2 40

100

200

SOFA score

frequency

cluster 2

0 2 40

100

200

SOFA score

frequency

cluster 3

0 2 40

100

200

SOFA score

frequency

cluster 4

Figure B.28: SOFA score distribution per cluster.

B.3 Output variable

Also shown is the distribution of patients with and without vasopressor administered, the output variable

of our study (Figure B.29).

51

0 0.5 10

100

200

300

on vasopressorsfr

equency

cluster 1

0 0.5 10

100

200

300

on vasopressors

frequency

cluster 2

0 0.5 10

100

200

300

on vasopressors

frequency

cluster 3

0 0.5 10

100

200

300

on vasopressors

frequency

cluster 4

Figure B.29: Vasopressors administration distribution (1 if they were administered).

B.4 Cluster distribution by pathology

Finally the histograms pertaining some of the clinical conditions prone to the development of distributive

shock: pneumonia, pancriatitis and systemic inflammatory response syndrome (Figures B.30 to B.32).

0 0.5 10

100

200

Pneumonia

frequency

cluster 1

0 0.5 10

100

200

Pneumonia

frequency

cluster 2

0 0.5 10

100

200

Pneumonia

frequency

cluster 3

0 0.5 10

100

200

Pneumonia

frequency

cluster 4

Figure B.30: Pneumonia patients distribution per cluster. (1-with pneumonia. 0-without)

0 0.5 10

100

200

300

Pancriatitis

frequency

cluster 1

0 0.5 10

100

200

300

Pancriatitis

frequency

cluster 2

0 0.5 10

100

200

300

Pancriatitis

frequency

cluster 3

0 0.5 10

100

200

300

Pancriatitis

frequency

cluster 4

Figure B.31: Pancriatitis patient distribution per cluster (1-with, 0-without).

0 0.5 10

100

200

300

SIRS

frequency

cluster 1

0 0.5 10

100

200

300

SIRS

frequency

cluster 2

0 0.5 10

100

200

300

SIRS

frequency

cluster 3

0 0.5 10

100

200

300

SIRS

frequency

cluster 4

Figure B.32: SIRS patient distribution per cluster. (1-with, 0-without)

52

Appendix C

Projections

Figure C.1 shows the projection of each variable of the five feature reduced set, plotted two by two,

the centers of the clusters represented by a black circle. The negatives (patients without vasopres-

sors) are presented in dark grey, while lighter gray represents the positives (patients with vasopressors

administred).

0 0.5 10

0.5

1

0 0.5 10

0.5

1

0 0.5 10

0.5

1

0 0.5 10

0.5

1

0 0.5 10

0.51

0 0.5 10

0.5

1

0 0.5 10

0.5

1

0 0.5 10

0.5

1

0 0.5 10

0.5

1

0 0.5 10

0.5

1

Figure C.1: projections

53

References

[1] C. A. Graham and T. R. J. Parke. Critical care in the emergency department: shock and circulatory

support. Emergency Medicine Journal, 22(1):17–21, 2005.

[2] Stefan Herget-Rosenthal, Fuat Saner, and Lakhmir S Chawla. Approach to hemodynamic shock

and vasopressors. Clinical Journal of the American Society of Nephrology, 3(2):546–53, 2008.

[3] R. C. Bone, R. A. Balk, R. A. Cerra, R. P. Dellinger, A. M. Fein, W. A. Knaus, R. M. Schein, and W. J.

Sibbald. Definitions for sepsis and organ failure and guidelines for the use of innovative therapies

in sepsis. Chest, 6(101):1644–1655, 1992.

[4] D. J. Hand, H. MANNILA, and P. SMYTH. Principles of data mining. Adaptative computation and

machine learning series. MASSACHUSETTS INSTITUTE OF TECHNOLOGY. MIT, 2001.

[5] H. C. Koh and G. Tan. Data mining applications in healthcare. Journal of healthcare information

management, 19(2):64–72, 2005.

[6] Illhoi Yoo, Patricia Alafaireet, Miroslav Marinov, Keila Pena-Hernandez, Rajitha Gopidi, Jia-Fu

Chang, and Lei Hua. Data mining in healthcare and biomedicine: a survey of the literature. J

Med Syst, 36(4):2431–48, 2012.

[7] Asaad Y. Shamseldin, Kieran M. O’Connor, and G.C. Liang. Methods for combining the outputs of

different rainfall–runoff models. Journal of Hydrology, 197(1–4):203 – 229, 1997.

[8] Lihua Xiong, Asaad Y. Shamseldin, and Kieran M. O’Connor. A non-linear combination of the fore-

casts of rainfall-runoff models by the first-order takagi–sugeno fuzzy system. Journal of Hydrology,

245(1–4):196 – 217, 2001.

[9] A. P. Weigel, M. A. Liniger, and C. Appenzeller. Can multi-model combination really enhance the

prediction skill of probabilistic ensemble forecasts? Quarterly Journal of the Royal Meteorological

Society, 134(630):241–260, 2008.

[10] L. F. Mendonca, J. M. C. Sousa, and J. M. G. Sa da Costa. An architecture for fault detection and

isolation based on fuzzy methods. Expert Systems with Applications, 36(2):1092–1104, 2009.

[11] R. Polikar. Ensemble learning. 4(1):2776, 2009.

[12] Leo Breiman. Random forests. Machine Learning, 45(1):5–32, 2001.

55

[13] Moacir P Ponti Jr. Combining classifiers: From the creation of ensembles to the decision fusion. In

2011 24th SIBGRAPI Conference on Graphics, Patterns, and Images Tutorials, pages 1–10. IEEE,

2011.

[14] L.I. Kuncheva. Combining Pattern Classifiers: Methods and Algorithms. Wiley, 2004.

[15] Pinaki. Chowdhury, Sukhendu. Das, Suranjana. Samanta, and Utthara. Mangai. A Survey of De-

cision Fusion and Feature Fusion Strategies for Pattern Classification. IETE Technical Review,

27(4):293–307, 2010.

[16] A. S. Fialho, F. Cismondi, S. M. Vieira, J. M. C. Sousa, S. R. Reti, L. A. Celi, M. D. Howell, and

S. N. Finkelstein. Fuzzy modeling to predict administration of vasopressors in intensive care unit

patients. In 2011 IEEE International Conference on Fuzzy Systems (FUZZ), pages 2296–2303,

june 2011.

[17] Federico Cismondi, Abigail L. Horn, Andre S. Fialho, Susana M. Vieira, Shane R. Reti, Joao

M. C. Sousa, and Stan Finkelstein. Multi-stage modeling using fuzzy multi-criteria feature selec-

tion to improve survival prediction of ICU septic shock patients. Expert Systems with Applications,

39(16):12332–12339, 2012.

[18] Rudolf Kruse, Christian Doring, and Marie-Jeanne Lesot. Fundamentals of fuzzy clustering, 2007.

[19] M. Halkidi, Y. Batistakis, and M. Vazirgiannis. Clustering algorithms and validity measures. In Pro-

ceedings of the 13th International Conference on Scientific and Statistical Database Management,

SSDBM ’01, page 3, Washington, DC, USA, 2001. IEEE Computer Society.

[20] J. MacQueen. Some methods for classification and analysis of multivariate observations. In In 5th

Berkeley Symposium on mathematics, Statistics and Probability, pages 281–298, 1967.

[21] L. A. Zadeh. Fuzzy sets. Information and Control, 8(3):338–353, 1965.

[22] J. C. Dunn. A fuzzy relative of the isodata process and its use in detecting compact well-separated

clusters. Journal of Cybernetics, 3(3):32–57, 1973.

[23] James C. Bezdek. Pattern Recognition with Fuzzy Objective Function Algorithms. Kluwer Academic

Publishers, Norwell, MA, USA, 1981.

[24] J. M. C. Sousa and U. Kaymak. Fuzzy decision making in modeling and control. World Scientific

series in robotics and intelligent systems. World Scientific, 2003.

[25] Yanchi Liu, Zhongmou Li, Hui Xiong, Xuedong Gao, and Junjie Wu. Understanding of internal

clustering validation measures. In Proceedings of the 2010 IEEE International Conference on Data

Mining, ICDM ’10, pages 911–916, Washington, DC, USA, 2010. IEEE Computer Society.

[26] L. Kaufman and P.J. Rousseeuw. Finding groups in data: an introduction to cluster analysis. Wiley

series in probability and mathematical statistics: Applied probability and statistics. Wiley, 1990.

56

[27] David L. Davies and Donald W. Bouldin. A cluster separation measure. Pattern Analysis and

Machine Intelligence, IEEE Transactions on, PAMI-1(2):224–227, 1979.

[28] A. M. Bensaid, L. O. Hall, J. C. Bezdek, L. P. Clarke, M. L. Silbiger, J. A. Arrington, and R. F. Murtagh.

Validity-guided (re)clustering with applications to image segmentation. IEEE Transactions on Fuzzy

Systems, 4(2):112–123, may 1996.

[29] X. L. Xie and G. Beni. A validity measure for fuzzy clustering. IEEE Transactions on Pattern Analysis

and Machine Intelligence, 13(8):841–847, aug 1991.

[30] Y. Fukuyama and M. Sugeno. A new method of choosing the number of clusters for fuzzy c-means

method. 1989.

[31] I. Gath and A.B. Geva. Unsupervised optimal fuzzy clustering. IEEE Transactions on Pattern

Analysis and Machine Intelligence, 11(7):773–780, 1989.

[32] Usama Fayyad, Gregory Piatetsky-shapiro, and Padhraic Smyth. From data mining to knowledge

discovery in databases. AI Magazine, 17:37–54, 1996.

[33] Susana Viera. Soft Computing Techniques Applied to Feature Selection. PhD thesis, Universidade

Tecnica de Lisboa, 2010.

[34] M. Sugeno and T. Yasukawa. A fuzzy-logic-based approach to qualitative modeling. IEEE Transac-

tions on Fuzzy Systems, 1(1):7, feb. 1993.

[35] Susana M. Vieira, Joao M. C. Sousa, and Uzay Kaymak. Fuzzy criteria for feature selection. Fuzzy

Sets and Systems, 189(1):1–18, 2012.

[36] T. Takagi and M. Sugeno. Fuzzy identification of systems and its application to modeling and

control. IEEE Trans. System Man and Cybernetics, (15):116–132, 1985.

[37] J. A. Swets. Measuring the accuracy of diagnostic systems. Science, 240(4857):1285–1293, 1988.

[38] Tom Fawcett. An introduction to roc analysis. Pattern Recogn. Lett., 27(8):861–874, June 2006.

[39] Mohammed Saeed, Mauricio Villarroel, Andrew T. Reisner, Gari Clifford, Li-Wei Lehman, George

Moody, Thomas Heldt, Tin H. Kyaw, Benjamin Moody, and Roger G. Mark. Multiparameter intelli-

gent monitoring in intensive care ii (mimic-ii): A public-access intensive care unit database. Critical

Care Medicine, 39:952–960, May 2011.

[40] G. D. Clifford, W. J. Long, G. B. Moody, and P. Szolovits. Robust parameter extraction for decision

support using multimodal intensive care data. Philosophical Transactions of the Royal Society A:

Mathematical, Physical and Engineering Sciences, 367(1887):411–429, 2009.

[41] P. D. Allison. Missing Data. Number 136 in Quantitative Applications in the Social Sciences. SAGE

Publications, 2001.

57

[42] A. S. Fialho, F. Cismondi, S. M. Vieira, J. M. C. Sousa, S. R. Reti, R. Welsh, Welsch, M. D. Howel,

and S. N. Finkelstein. Missing data in large intensive care units databases. Critical Care Medicine,

38(12):U6–U6, 2010.

[43] D. C. Hoaglin, F. Mosteller, and J. W. Tukey. Understanding robust and exploratory data analysis.

Wiley Classics Library Editions. Wiley, 2000.

[44] F. Cismondi, A. S. Fialho, S. M. Vieira, J. M. C. Sousa, S. R. Reti, M. D. Howell, and S. N. Finkel-

stein. Computational intelligence methods for processing misaligned, unevenly sampled time series

containing missing data. In 2011 IEEE Symposium on Computational Intelligence and Data Mining

(CIDM), pages 224–231, april 2011.

[45] N. R. Pal and J. C. Bezdek. On cluster validity for the fuzzy c-means model. IEEE Transactions on

Fuzzy Systems, 3(3):370–379, aug 1995.

58

josé miguel mourinho rodrigues - autenticação...josé miguel mourinho rodrigues thesis to obtain...

Documents