a data mining approach to predict conversion from mild cognitive impairment … · abstract...

A data mining approach to predict conversion from mildcognitive impairment to Alzheimer’s Disease

Luıs Jorge Matias de Lemos

Dissertation submitted to obtain the Master Degree inInformation Systems and Computer Engineering

JuryPresident: Doctor Jose Carlos Alves Pereira MonteiroSupervisor: Doctor Sara Alexandra Cordeiro MadeiraCo-supervisor: Doctor Pedro Filipe Zeferino TomasMembers: Doctor Claudia Martins Antunes

Doctor Alexandre Valerio de Mendonca

November 2012

Acknowledgments

First, I would like to thank my family for the given support.

Secondly, I would like to thank my advisors, Sara Madeira and Pedro Tomas, for having a

fundamental and difficult role in this work by guiding and motivating me. I would like to say thanks

to all the members of the NEUROCLINOMICS team by the hours that they spent listing to my

presentations and providing valuable feedback.

My thanks to all my colleagues from room 128/425, and to Andre Silva for helping me with the

english.

This work was partially supported by FCT - Fundacao para a Ciencia e a Tecnologia under

project PTDC/EIA-EIA/111239/2009 (NEUROCLINOMICS - Understanding NEUROdegenerative

diseases through CLINical and OMICS data integration).

Finally a special thanks to the author of the template used in this thesis, Pedro Tomas.

Abstract

Alzheimer’s disease (AD) is a well known neurodegenerative disease causing cognitive impair-

ment. Despite being one of the best studied diseases of the central nervous system, it remains

incurable. Mild Cognitive Impairment (MCI) is currently considered to be an early stage of a neu-

rodegenerative disease. Patients diagnosed with MCI are assumed to have higher risk to evolve

to AD. In this context, the correct diagnosis of MCI and an effective assessment of its predictive

value for the conversion to AD are crucial.

In this thesis, neuropsychological data is used to distinguish patients with MCI from those

already suffering from AD and to predict the evolution of MCI patients to AD in time windows of 2,

3 and 4 years. We analyse a dataset with patients labelled by the clinicians as MCI or AD. As in

the case for most real clinical data, this dataset is strongly imbalanced and has a high percentage

of missing values.

We use state of the art supervised learning techniques and perform a throughout study on the

effect of class imbalance and missing values in their performance. Since the number of attributes

is large, feature selection is studied and used to effectively decrease the dimensionality of the

problem. A data mining methodology has been created to automatize the oversampling and

parameters search. The created tool automatically creates the models and parametrizes them,

having in account the balance state of the data. The obtained results indicate that the developed

models achieve an accuracy of 91% in patient diagnosis and up to 82% in patient prognosis.

To help healthcare professionals, a decision support system was created that includes the

diagnosis and prognosis models developed in this work.

Keywords

Alzheimer’s Disease, Data Mining, Temporal Windows, diagnosis, prognosis, prediction.

iii

Resumo

A doenca de Alzheimer (DA) e uma conhecida doenca neuro degenerativa causando uma

deficiencia cognitiva. Apesar de ser uma das doencas mais estudas do sistema nervosos central,

contınua sem cura. A deficiencia cognitiva ligeira (DCL) e considerada um estado inicial de

uma doenca neuro degenerativa. Assume-se que pacientes diagnosticados com DCL, tem um

maior risco de evolucao para DA. Neste contexto um correto diagnostico e uma analise eficaz da

probabilidade de conversao sao cruciais.

Nesta tese, dados neuro psicologicos sao utilizados para distinguir os pacientes com DCI

daqueles com DA e para prever a conversao dos doentes com DCI para DA, numa janela temporal

de 2, 3 e 4 anos. O conjunto de dados analisado foi classificado por medicos como DCI e DA.

Como em qualquer conjunto de dados clınicos reais, estes sao extremamente desbalanceados e

com um elevado numero de valores omissos.

Foram utilizados tecnicas, do estado da arte, de aprendizagem supervisionada, tambem foi

realizado um estudo sobre o efeito na classificacao do desbalanceamento de classes e dos val-

ores omissos. Como o numero de atributos e grande, tecnicas de selecao de atributos foram

utilizados para diminuir a dimensionalidade do problema. Uma metodologia foi criada para au-

tomatizar o processo de sobre amostragem e de procura de parametros. A ferramenta criada,

automaticamente cria os modelos de classificacao e parametriza-os, tendo em conta o estado de

balanceamento dos dados. Os resultados obtidos indicam que os modelos desenvolvidos obtem

uma precisao, no diagnostico de pacientes, na ordem dos 91% e no prognostico na ordem dos

82%.

Para auxiliar os profissionais de saude, um sistema de apoio a decisao foi criado e colocado a

v

disposicao destes, com os modelos de diagnostico e prognostico criadas no decorrer desta tese.

Palavras Chave

Doenca de Alzheimer, Mineracao de Dados, Janelas temporais, Diagnostico, Prognostico,

Predicao.

vi

Contents

1 Introduction 1

1.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3 Dissertation outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Background 5

2.1 Alzheimer’s Disease . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.1.1 Neuropsychological tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.3 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.3.1 Correlated based . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.3.2 Minimum redundancy maximum relevance (mRMR) . . . . . . . . . . . . . 18

2.4 Overcoming class imbalance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.4.1 Techniques to deal with imbalanced datasets . . . . . . . . . . . . . . . . . 20

2.4.1.A Random Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.4.1.B Synthetic Minority Over-sampling Technique (SMOTE) . . . . . . 20

2.4.1.C Cost sensitive classifiers . . . . . . . . . . . . . . . . . . . . . . . 21

2.5 Model Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.5.1 Overfitting

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3 Differentiating MCI from AD (Diagnosis) 29

3.1 Formulation and Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.1.1 Data Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.1.2 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.1.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.2 Missing values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.2.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

vii

Contents

3.2.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.2.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.3 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4 Predicting conversion from MCI to AD (Prognosis) 45

4.1 Prognosis prediction approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4.2 Classification Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.3 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4.3.1 First and Last Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4.3.2 Temporal Windows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

5 Decision Support System 61

5.1 System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

5.2 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

6 Conclusions and Future Work 67

A Appendix Medical exams (in Portuguese) 75

B Appendix Diagnosis 81

C Appendix Prognosis 83

C.1 First and Last Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

C.2 Temporal window: Two years . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

C.3 Temporal window: Three years . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

C.4 Time window: Four years . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

viii

List of Figures

2.1 Decision tree that using 2 attributes . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2 Neurode representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.3 Multilayer feed-forward neural network. . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.4 SVM linear separable data representation . . . . . . . . . . . . . . . . . . . . . . . 16

2.5 Overfitting (Error versus Model Complexity) . . . . . . . . . . . . . . . . . . . . . . 24

3.1 Histogram of missing values per feature in the original dataset. . . . . . . . . . . . 31

3.2 Classification model for the training data. . . . . . . . . . . . . . . . . . . . . . . . 33

3.3 Model that simulate the real work usage. . . . . . . . . . . . . . . . . . . . . . . . . 34

3.4 Missing replacing using random assumption. . . . . . . . . . . . . . . . . . . . . . 36

3.5 Missing replace using non random assumption. . . . . . . . . . . . . . . . . . . . . 37

3.6 Train results of Diagnosis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.7 Results for the diagnosis using an independent test set . . . . . . . . . . . . . . . 42

4.1 Representation of the prognosis class labels . . . . . . . . . . . . . . . . . . . . . 46

4.2 Variation of the prognosis class labels . . . . . . . . . . . . . . . . . . . . . . . . . 46

4.3 Classification model for the training data. . . . . . . . . . . . . . . . . . . . . . . . 47

4.4 Model that simulate the real work usage. . . . . . . . . . . . . . . . . . . . . . . . . 47

4.5 Test results of Prognosis using First And Last Evaluations . . . . . . . . . . . . . . 50

4.6 Test results of Prognosis using two years temporal window . . . . . . . . . . . . . 52

4.7 Test results of Prognosis using three years temporal window . . . . . . . . . . . . 55

4.8 Test results of Prognosis using four years temporal window . . . . . . . . . . . . . 58

5.1 DDS web service architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

5.2 Prototype data input screen. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

5.3 Prototype output screen. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

5.4 DDS system screen. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

C.1 Train results of Prognosis using First and Last Evaluations . . . . . . . . . . . . . . 83

C.2 Train results of Prognosis using two years temporal window . . . . . . . . . . . . . 85

C.3 Train results of Prognosis using three years temporal window . . . . . . . . . . . . 86

ix

List of Figures

C.4 Train results of Prognosis using four years temporal window . . . . . . . . . . . . 87

x

List of Tables

2.1 Confusion matrix. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.2 Synthesis of the related work. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.1 Original Dataset details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.2 Dataset details after pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.3 Grid search parameter intervals. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.4 Details of the train set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.5 Details of the test set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.6 Diagnosis Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.7 Best Classifiers for the Diagnosis problem. . . . . . . . . . . . . . . . . . . . . . . 42

3.8 Confusion matrix for the diagnosis problem. . . . . . . . . . . . . . . . . . . . . . . 43

4.1 FirstLastDemographic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4.2 First and Last Evaluation’s Parameters. . . . . . . . . . . . . . . . . . . . . . . . . 48

4.3 Best Classifiers for the Prognosis problem First and Last approach. . . . . . . . . . 49

4.4 Two Years Temporal Window Demographic . . . . . . . . . . . . . . . . . . . . . . 51

4.5 Classification model parameters for the prognosis in a temporal window of 2 years. 51

4.6 Best Classifiers for the prognosis problem with a 2 years temporal window. . . . . 53

4.7 Three Years Temporal Window Demographic . . . . . . . . . . . . . . . . . . . . . 53

4.8 Classification model parameters for the prognosis in a temporal window of 3 years . 54


4.10 Four Years Temporal Window Demographic . . . . . . . . . . . . . . . . . . . . . . 56

4.11 Classification model parameters for the prognosis in a temporal window of 4 years. 56


4.13 Best Models to each progression approach . . . . . . . . . . . . . . . . . . . . . . 59

A.1 Feature List part-1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76




xi

List of Tables


B.1 Selected Features for diagnosis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

C.1 Selected Features for Prognosis using first and last evaluations. . . . . . . . . . . 84

C.2 Selected Features for Prognosis using the two years temporal window. . . . . . . . 85

C.3 Selected Features for Prognosis using the three years temporal window. . . . . . . 86

C.4 Selected Features for Prognosis using the four years temporal window. . . . . . . . 87

xii

1Introduction

Declines in cognitive and motors functions, together with other evidences of neurological de-

generation, become increasingly likely as healthy people age. The fact is that everyone will ex-

perience altered brain functions, although some at an earlier age or at a faster rate than others.

As such, distinguishing the motor and cognitive declines of normal ageing from those due to

pathological processes and understanding the individualized disease diagnostic and prognostic

patterns are ongoing research challenges [42]. In this context, Alzheimer’s disease (AD), a well

known neurodegenerative disease causing cognitive impairment, is amongst the best studied dis-

eases of the central nervous system due to its devastating effect on patients and their family, and

to the socio-economic impact in modern societies. Nevertheless, it remains uncurable.

Every year millions of new Alzheimer’s Disease cases are diagnosed. The result is dementia

on elderly individuals; internment and expensive medical care are a common outcome. An early

diagnostic and prognosis can improve the patient quality of life, minimizing the need for internment

and expensive medical care, reducing the patient and the family’s suffering and minimizing the

social-economic effects on the society. In this context, finding out if and when a patient will

progress from Mild Cognitive Impairment (MCI) to Alzheimer’s Disease is of a major importance

to the timely administration of pharmaceutics and therapeutic interventions. Furthermore it allows

medical doctors to adjust periodicity of medical consults.

Mild Cognitive Impairment is currently considered to be an early stage of a neurodegenerative

disease, particularly AD. Patients diagnosed with MCI are regarded with special attention since

they are assumed to have higher risk to evolve to dementia, usually AD [51]. Under these as-

sumptions, the correct diagnosis of MCI conditions and an effective assessment of its predictive

1

1. Introduction

value for the conversion to AD are thus of major importance. However, the definition of MCI and

its diagnosis criteria are not yet consensual; the pathologic and molecular substrate of people

diagnosed with MCI is not well established [40]. Moreover, people considered to be suffering

from preMCI, that is people having cognitive complains but not fulfilling the criteria for MCI, have

recently been shown to have high risk of progression to MCI and AD [33]. This makes the di-

agnosis of MCI a difficult task in itself and consequently transforms the prediction of MCI to AD

conversions into an even more complicated task.

Neuropsychological tests have been used by medical doctors mainly because they are cheaper

and faster than PET Scans and biomarkers search. Furthermore, technology such as PET Scans

and biomarkers are not globally available. The neuropsychological tests involve simple tasks

such as those concerning orientation, memory, attention and language to evaluate the mental

state of the patient. This work aims to use this data to predict the conversion of Mild Cognitive

Impairment (MCI) to Alzheimer’s Disease (AD). The use of data mining algorithms will allow the

extraction of knowledge or rules from the data in what regards the prediction of MCI to AD.

1.1 Problem Formulation

To study the relation between MCI and AD, researchers typically focus on three related but dis-

tinct problems [7, 18, 25, 27, 36, 50]: (1) distinguishing MCI from AD; (2) predicting the conversion

from MCI to AD; and (3) predicting the time to conversion from MCI to AD. In this work, we tackle

the problem of distinguishing patients with MCI, from those already suffering from AD, using neu-

ropsychological data. This type of data has also been used by other authors [7, 18, 25, 36]. Giving

the increasing difficulty of these three problems and the non-consensual classification of patients

as MCI, distinguishing MCI from AD is an important problem in itself but it gains increasing rele-

vance as a support for the feasibility of effectively tackling the conversion and time to progression

problems.

This work tackles two related problems: (i) distinguishing MCI from AD patients; and (ii) pre-

dicting if a MCI patient will evolve to AD. To achieve this goal, we use state of the art machine

learning techniques, such as SVMs, artificial neural networks, Naıve Bayes, C4.5 Decision trees

and k-nearest neighbour. Furthermore, we investigate the application of techniques that reduce

the influence of missing values in the learning process and that reduce data complexity, therefore

increasing model generalization.

1.2 Contributions

The outcome of this work is the following:

1. Development of a framework for model optimization that includes techniques for: (i) dimen-

sionality reduction, through feature selection; (ii) reducing the effect of class imbalance,

2

1.3 Dissertation outline

through synthetic minority oversampling; and (iii) dealing with missing values.

2. Application of the model optimization framework to the diagnosis problem (overall accuracy

of up to 91%) and to the prognostic problem (overall accuracy of up to 82%, for a 3 years

temporal window).

3. Proposal of a temporal window methodology for the prognostic of MCI patients, with results

showing that it overcomes the traditional approach;

4. Development of an decision support system, based on a webservices infra-structure, for the

deployment of the developed models in the healthcare professionals office.

Part of the results obtained in the course of this thesis were already communicated in:

[13] Lemos et. al, ”Discriminating alzheimer’s disease from mild cognitive impairment using

neuropsychological data”, ACM SIGKDD Workshop on Health Informatics (HI-KDD2012), Beijing,

China, August 2012;

[14] Lemos et. al, ”Predicting conversion from mild cognitive impairment to alzheimer’s dis-

ease using neuropsychological data: Preliminary results”, 26o Meeting of Grupo de Estudo de

Envelhecimento Cerebral e Demencias, Tomar, June 2012;

Furthermore, a new paper is under preparation for communicating the results concerning prog-

nostic prediction of the Alzheimer’s disease, namely the comparison of the temporal windows

approach against the traditional approach.

1.3 Dissertation outline

The work described in this thesis is organized as follows. Chapter 2 describes the background

on Alzheimer’s disease and introduces the basic concepts on data mining algorithms and tech-

niques. Chapter 3 describes the work related to the diagnosis of patients, i.e., regarding the differ-

entiation between MCI and AD patients. For this, a framework for obtaining the model parameters

is described, a comparison between several missing value imputation techniques is performed

and the results for patient diagnosis are presented. Chapter 4 applies a similar strategy to per-

form prognostic prediction of MCI patients. This considers 2 approaches: (i) predicting whether a

patient will ever convert to AD and (ii) predicting if a patient will convert to AD after a fixed period

of time. Results presented in this chapter suggest the latter approach leads to more accurate

models. All these models then integrated into a decision support system, based on a webser-

vices infrastructure, which architecture is briefly described in Chapter 5. Finally the conclusions

are drawn in Chapter 6 where some possibilities regarding future work are also presented.

3

1. Introduction

4

2Background

2.1 Alzheimer’s Disease

By the latest estimates, 25 million people currently suffer from dementia and as a consequence

of population ageing, the number of persons affected by this condition is expected to climb, dou-

bling every 20 years. Complains of cognitive matter are very common in aged individuals and can

be the first sign of on-going neurologically disorders such as AD [37], which is the most common

irreparable, progressive cause of dementia. AD can be described by a gradual loss of memory

and cognitive skills. Every year over 5 million of new cases are reported and the incidence in-

creases with the age of the individual: 1% at the ages of 60, 6% at the ages of 70 and 8% at

ages of 85 or older. These numbers are likely to raise in consequence of the excepted lifetime

increase [2]. The relation between age and AD incidence is evident, making age the most likely

influential risk factor in the diagnosis of AD. In fact, AD incidence raises from 2.8 per thousand

people-year for individuals between 65 and 69 years old to 56.1 per thousand person-year for in-

dividuals older than 90 years old. Nearly 10% of people older than 70 years old have a significant

memory loss, and probable more than half have AD. For people over 85 years old, it is estimated

that 25% to 45% have dementia [2]. It is possible to identify, from the people that have cognitive

complains, those who are in risk to progress to dementia. These are those suffering from MCI .

Since the MCI classification mandate the expression of a cognitive decline greater than excepted

for the person’s age and education level, neuropsychological testing is a fundamental element

in the diagnostic [37]. Currently many efforts are being carried out to investigate AD pathology

and develop appropriate treatment strategies. These strategies have center their attention on the

5

2. Background

long-term conservation of cognitive and functional abilities or slowing down the disease devel-

opment along with reducing behavioural symptoms and maintaining the patient’s quality of life.

Nowadays, there is no treatment leading to the cure or the complete stop of AD progression.

Nonetheless, a current medical objective is the diminution of symptoms that can delay the insti-

tutionalization of the patient, therefore reducing the caregiver costs [49]. It is possible to detect

traces, or biomarkers, of AD in patients with MCI, by the use of magnetic resonance imaging

volumetric studies, neurochemical analysis of the cerebrospinal fluid and Positron Emission To-

mography. These techniques have a higher accuracy and sensitivity that neuropsychological tests

in the progression of MCI patients to dementia, however these methods are expensive, technically

challenging, some invasive, and not widely accessible [37].

2.1.1 Neuropsychological tests

To diagnose, determine the stage, assess and monitor AD, MCI and other dementia, the men-

tal health of a patients is assessed though neuropsychological assessment using a common set of

tests. These tests aim to of identify and quantify cognitive, functional and behavioural symptoms.

A number of test batteries have been developed by medical doctors to assess the mental health of

patients. The more important batteries are: Mini-Mental State Examination (MMSE), Alzheimer’s

Disease Assessment Scale (ADAS) and, in our case the Bateria de Lisboa para Avaliacao de

Demencia (Lisbon Test Battery for Dementia Evaluation in English) (BLAD). Each battery is com-

posed of multiples tests, where some batteries are composed of multiples other batteries.

MMSE is one the most widely used test batteries to perform a brief evaluation of cognitive

status in adults [38] [20] [52]. ADAS was designed to measure the severity of the most important

symptoms of AD. The Alzheimer’s Disease Assessment Scale Cognitive subscale (ADAS-cog) is

the most popular cognitive testing instrument used in clinical trials of nootropics (drugs, functional

foods, supplements, etc, that improve mental functions). It consists of 11 tasks measuring the

disturbances of memory, language, praxis, attention and other cognitive abilities which are often

referred to as the core symptoms of AD [31].

BLAD [6] [52] is a comprehensive neuropsychological battery evaluating multiple cognitive do-

mains and it has been validated for the Portuguese population. This battery includes tests for

the following cognitive domains: attention (Cancellation Task); verbal, motor and graphomotor

initiatives (Verbal Semantic Fluency, Motor Initiative and Graphomotor Initiative); verbal compre-

hension (a modified version of the Token Test); verbal and non-verbal abstraction (Interpreta-

tion of Proverbs and the Raven Progressive Matrices); visual-constructional abilities (Cube Copy)

and executive functions (Clock Draw); calculation (Basic Written Calculation); immediate memory

(Digit Span forward); working memory (Digit Span backward); learning and verbal memory (Verbal

Paired-associate Learning, Logical Memory and Word Recall).

The data used in this work was obtained using BLAD. Each evaluation of a patient corresponds

6

2.2 Classification

to an instance identified by the date of evaluation and the ID of the patient. The majority of

patients have multiple evaluations. Each of them is associated with an evaluation date and a

patient classification, which is one of the following: Normal, Pre-MCI, MCI-frontal, slight MCI,

MCI, amnesic MCI, MCI md, advanced MCI, MCI-a, vascular MCI, slight Dementia, Dementia and

without class. This data was initially pre processed to contain only 4 classes: Normal, Pre-MCI,

MCI and Dementia. The new MCI class is now composed by all MCI subtypes. The instances

without classification have been removed. Many instances contain missing values for a set of

neuropsychological tests. The normal and pre-MCI instances were also discarded.

2.2 Classification

To extract useful knowledge from this data, data mining techniques have to be used. Generally

these are divided into 7 steps [23]:

1. Data Cleaning to remove inconsistent data and outliers, in the case of our problem this is

of great importance and has already been addressed by reporting errors to the medical

doctors.

2. Data integration to combine multiple data from different sources. In this work there was no

need to perform it since the data already was delivered as a single table. This may however

be important in the future if data from other database is integrated (e.g ADNI data base) 1.

3. Data selection (sometimes referred as feature selection) to discover relevant features. This

step is of enormous importance to simplify the data, and minimize the confusion presented

to the classifier.

4. Data Transformation, to transform the data in order to be fit to the classification process. For

now this step is performed This step for now is done automatically by the WEKA software

[22]. In the future this will be done if the new algorithms that will be studied have the need

of this.

5. Data Mining, the process where intelligent methods are applied in order to extract data

patterns. For the problem of the diagnosis and prognosis, a comparative study of various

algorithms was performed.

6. Pattern evaluation, with the purpose of recognizing interesting patterns that represented the

knowledge. This step corresponds to the evaluation the different classifiers and the different

parameters used.

1http://adni.loni.ucla.edu/ last accessed in 13 October 2012

7

2. Background

7. Knowledge presentation, where visualization and knowledge representation techniques are

used to show the knowledge acquired to the user. In our case, this will be performed when

we show our results and rules to the medical doctors.

Classification can be described as a two-stage process [23]. In the first stage, a classifier

describing a set of data classes is built. This stage is designated by learning step or training

phase. In this stage the classification algorithm is ”learning from” a training set composed by

instances of data, which are made up of a n-dimensional attribute vector, X =(x1,...,xn), and a

class label. In this case X a set of attributes extracted from the neuropsychological data and the

class label is the patient mental health given by a medical evaluation and categorized as MCI and

AD. The attributes in vector X is can be numerical or categorical. The instances used to train the

classification algorithm compose the training set. This type of process is known as supervised

learning, since the class label attribute is provided to each X in distinction to the unsupervised

learning algorithms that do not known the class label attribute or the number of classes to be

learned in advance. In our case we can use unsupervised methods to obtain subsets of MCI for

example. In the context of classification, the set of n-dimensional attribute vector that represents

an evaluation of the patient, and the respective class label attribute can be named as instances. In

the second step, the model obtained is used to classify the test set. The test set is a subset of data,

independent from the train set, that is used to measure the accuracy of the classification model.

It should be noticed that in this we only use supervised methods. However, as the future work

unsupervised learning could be used. For example, to decrease the complexity in the classifier,

clustering techniques can be applied to divide the MCI group into sub groups.

k -Nearest-Neighbour Classifiers

The k -Nearest-neighbour (kNN) classification consists on learning by comparison. Suppose

we define a metric to evaluate the distance between two instances, for example, or the Euclidean

distance:

dist(X1, X2) =

√√√√ n∑i=1

(x1i − x2i)2 (2.1)

or the the Manhattan distance:

dist(X1, X2) =

n∑i=0

|x1i − x2i| (2.2)

where X1 = (x11, x12, ..., x1n) and X2 = (x21, x22, ..., x2n). The kNN algorithm works as fol-

lows. For each instance Xi in the test set, find the K nearest instances in the training set

(Xi1, Xi2, ..., XiK). Following classify the instance Xi as belonging to the most common class

within the K nearest neighbours. Typically, the values of each attribute are normalized before

using the Euclidean distance. This prevents the under-weighting of attributes with a smaller range

relatively to attributes with a larger range. To deal with missing values, the classifier assumes the

8

2.2 Classification

highest difference between these two attributes. This can cause mistakes in the classifier. How-

ever adequate pre-processing can overcome limitations of the algorithm, by for example replacing

missing values by the mean value of the attribute [12].

Naıve Bayes

Bayesian classifiers are statistical methods that can forecast the class membership probabilities

using the Bayes theorem in (2.3)

P (H|X) =P (X|H)P (H)

P (X)(2.3)

where H represents some hypothesis, such as belonging to a class Ci and X is the instance.

Studies comparing classification algorithms have discovered that a simple Bayesian classifier

such as Naıve Bayes, can in some cases, be analogous in performance with decision trees and

neural network classifiers [23]. Bayesian classifiers have also demonstrated high accuracy and

speed when applied to large amounts of data. Naıve Bayes classifiers assume that attributes are

independent and work as follows [23]:

1. Assume D to be a training set (n-dimensional attribute vector and respective class label).

2. Suppose that there are m classes, C1, C2, ..., Cm. Given a test instance, X, the classifier

will predict to what class X belongs to, by choosing the class having the highest posterior

probability, P (Ci|X):

∀j 6=iP (Ci|X) > P (Cj |X) (2.4)

The P (Ci|X) is called the maximum posteriori hypothesis.

3. Since P (X) is constant for all classes we only need to maximize P (X|H)P (H). Replacing

hypothesis H by the Class Ci we have P (X|Ci)P (Ci). If the class prior probabilities are

unknown, then we assume that all classes are equally likely, and we maximize P (X|Ci).

4. Since Naıve Bayes classifiers assume that attributes are conditionally independent it results

that

P (X|Ci) =

n∏k=1

P (xk|Ci) = P (x1|Ci) ∗ P (x2|Ci) ∗ ... ∗ P (xn|Ci) (2.5)

5. Estimation of P (Xk|Ci) is performed differently when dealing with categorical and continuous-

valued attributes. For categorical attributes, P (xk|Ci) is the number of instances of class Ci

in training set having the value xk for an attribute k, divided by the number of instances of the

class Ci in training set. For continuous-valued attributes, is the probability density function

must be estimated. A simple approach is to assume that P (xk|Ci) is normally distributed, in

which case:

P (xk|Ci) = g(xk, µCi , σCi) (2.6)

9

2. Background

g(x, µ, σ) =1√2πσ

e−(x−µ)2

2σ2 (2.7)

where µ is the expected value of xk and σ is the standard deviation.

6. To predict the class label of the test instance, X, we use (2.5) for each class and then, by

applying (2.4) we obtain the most probable class.

Decision Trees

A decision tree is a model structure where each non-leaf node has a test on an attribute, each

branch represents an outcome of the test and each leaf has a class label (see Figure 2.1). The

top node is the root node. In Figure 2.1 the root node is the node that tests Attribute 1.

Attribute 1

Class 2Class 1 Attribute 2

Low

Normal

High

Class 2Class 1 Attribute 2

Class 2 Class 1

True False

Figure 2.1: A decision tree that uses 2 attributes: Attribute 1, that is ternary (Low, Normal andHigh), and Attribute 2, that is binary (true or false). Each internal or non-lead node (in blue)represents a test on an attribute. Each leaf node (in orange) represents a class (Class 1 or Class2).

This classifier receives an instance with an unknown class label, the attributes of the instance

are then tested against the decision tree. A path is traced from the root to the leaf node, having

the class label for that instance. In the case of Figure 2.1, an instance X = {Attribute 1 = Normal,

Attribute 2 = False} would be tested first against Attribute 1 node (root node). As Attribute1 =

Normal, would be tested against Attribute 2 and then since Attribute2 = False the tree would

predict that the instance belongs to Class 1. If Attribute 1 was Low or High only that test would be

necessary, and thus Attribute 2 would not be used.

A basic algorithm for the construction of a decision tree receives as input: set of instances, an

attribute list and an attribute selection method. The set of instances at the beginning is the training

set, but since this algorithm is recursive, this set will change along the execution. The attribute list

is a list of attributes that describes the instances. Finally, the attribute selection method specifies

a heuristic procedure for selecting the attribute that better discriminates the instances according

10

2.2 Classification

to the class. This procedure uses an attribute selection measure, such as information gain [48],

gini index [48] or gain ratio [3]. The tree begins as a single node, N , representing the training set.

If all the instances have the same class label then the node becomes a leaf, is annotated with

that class, and the algorithm ends. If not, the Attribute selection method is called to determinate

the splitting criterion. This method will determinate the best way to partition the instances into

individual classes. The splitting criterion indicates which branches should be grown from N with

respect to the outcomes of the chosen test. The node N is then labelled with the splitting criterion,

which is the test at the node. A branch is grown from N for each of the results of the splinting

criterion. The instances are divided according to the results. The splitting attribute will have one of

two possible scenarios: discrete-values or continuous-values. If the splitting attribute is discrete-

value, the results of the test at N correspond directly to the known value of the attribute. A branch

is created for all values of the attribute. In this case this attribute is removed from the attribute

list, since this attribute will not be considered in any future instance. If the splitting attribute is

continuous-valued, N has two results Attribute ≤ splittingpoint and Attribute > splittingpoint.

The splitting point is returned by the attribute selection method and two branches are gowned from

N with the results as labels. The algorithm is recursive, repeating the process for each subset of

the training set created. The possible stop conditions of the algorithm are:

• All the instances in the training set belong to the same class.

• There are no more attributes to split. In this case N is converted to a leaf and labelled with

the most common class in the current training set.

• There are no instances for a given branch. In this case, a leaf is created with the majority

class in the current training set.

Finally, the decision tree is returned.

If the training set has noise or outliers, this will generate branches that reflect this problem.

The tree pruning tries to identify and remove such branches. The most commonly used attribute

selection methods are the following

• Information Gain [23]

The information gain uses the value of information content of messages. The attribute cho-

sen to split is that with the higher information gain. This attribute minimizes the information

needed to classify the instances in the resulting partitions and maximizes the homogene-

ity of the label class in resulting partitions. This approach produces simple trees and re-

duces the number of tests. Let D, be the instance set. The information gain of D, is called

Gain(D), is defined as follows:

Gain(D) = −m∑i=1

pi ∗ log2(pi) (2.8)

11

2. Background

where pi is the probability that a random instance in D, belongs to a classi, and is estimated

by |Ci,D||D| . A base 2 logarithm is used since the information is encoded in bits. Gain(D) is

also known as the entropy of D.

If the attribute is discrete-value v branches will be grown. In the ideal case we want the

partition to have only instances from the same class, but that is rarely achieved. Thus, we

need to know how much more information is needed in order to obtain an exact classification.

We use the next expression for this purpose. Let, Dj be the set of instances in partition j.

GainA(D) =

v∑j=1

|Dj ||D|∗ info(Dj) (2.9)

The term |Dj |D represents the weight of the partition j. GainA(D) is the expected information

required to classify an instance from D based on this partition. Smaller values mean more

homogeneous partitions in relation to the class label.

The information gain is given by the difference between the original information based only

on the proportion of classes and information obtained after the partition.

Gain(A) = Gain(D)−GainA(D) (2.10)

In case of attributes that are continuous-valued, we have to determine the best splinting

point, where the split-point is a threshold in the attribute, after sorting the values of the

attribute in increasing order. In general, we use the midpoint between each pair of values.

The informative gain attribute selection method is used in the ID3 algorithm [47].

• Gain ratio [23]

The C4.5 uses an extension of information gain, called Gain Ratio which is computed as

follow:

GainRatio(A) =Gain(A)

SplitInfo(A)(2.11)

where SplitInfoA represents the potential information generated by splitting the train set.

SplitInfoA(D) = −v∑

j=1

|Dj ||D|∗ log2

(|Dj ||D|

)(2.12)

In this settings, the attribute with the highest value is chosen.

The C4.5 algorithm applies a kind of normalization to information gain using the split infor-

mation value.

• Gini index [23]

The Gini index measures the mistakes in a set of instances by using the following expres-

sion:

GiniD = 1−#classes∑class=i

p2class (2.13)

12

2.2 Classification

where pi represents the probability that an instance of D belongs to a Class Ci and is

estimated as CI,DD . C4.5 uses the Gini index and considers a binary split for each attribute.

In order to compute the attribute with the best binary split, in case of a discrete-value we

have to analyse all of the possible subsets of instances that can be formed using the know

values of the attribute. Each subset can be considered as a binary test for the attribute.

In case of two splits D1 and D2, GiniA is given by:

GiniA(D) =|D1||D|

Gini(|D1|) +|D2||D|

Gini(|D2|) (2.14)

In case of continuous-valued attributes, each possible binary split is analysed. This is a

similar to that of information gain, where each sorted pair of values is taken as possible

split. The point given in the minimum Gini index value is the chosen split point.

The reduction error in the classification occurring by performing a binary split on an attribute

is given by:

4Gini(A) = Gini(D)−GiniA(D) (2.15)

CART uses the Gini index.

Neural Networks

As in [23] a Neural Network is a set of connected input/output units or neurods where each

connection has an associated weight, as shown in Figure 2.2. The neural networks are computa-

tional analogues of neurons. Given a neurode j in a hidden or output layer, the net input, Ij to the

neurode is:

Ij =∑i

wijOi + θj (2.16)

where wij is the weight of the connection from neurode i to neurode j; Oi is the output of the

neurode i from the previous layer; and θj is the bias of the neurode, which allow to vary the

neurode activity. Each neurode in the hidden layer or output layers uses an activation function.

This function is a non-linear and differentiable logical function that allows for the classification of

problems that are linearly inseparables [23].

A neural network is simply a set of neurodes organized in layers. Each layer is composed by

neurodes in the case of the hidden and output layer. The neurodes in the input layer are called

input neurodes. The inputs to the neural network match up to the attributes measured in each

training instance. The input instance is fed simultaneously into the neurodes of the input layer.

These inputs pass the input layer and are weighted and used as input to the second layer, or

hidden layer. The outputs of one hidden layer can be the input of the next hidden layer, and so on.

The weighted outputs of the last hidden layer are the inputs of the output layer neurodes. These

neurodes emit the network’s prediction for a given instance.

A network is called feed-forward if none of the weights cycles back to a hidden neurode or to

an output neurode of a previous layer. The network is fully connected if each neurode provides

13

2. Background

y1

fy2 ∑

Weights

w1j

w2j

Bias

Ɵj

fy2

yn

∑ output

Inputs (outputs from

previous layers)

Wnj

Weighted Sum Activation function

Figure 2.2: The figure shows a neurode of a hidden or output layer. The inputs to the neurode areoutput of the previous layer. These are multiplied by their respective weight to form a weightedsum, which is finally added to the neurode bias. A non-linear activation function is then appliedto the next input. If the hidden layer is the first one then her inputs will correspond to the inputinstance.

an input to all neurodes in the forward layer. It can model the class prediction by a non-linear

combination of inputs, that is, by a non-linear regression. If given enough hidden neurodes and

enough training instances a neural network can closely approximate any function [23].

The design of the network topology is defined by the user, that is, the number of hidden layers,

the number of neurodes in each hidden layer, the number of input and output neurodes. The

choice of these values is usually a trial and error process and may affect the accuracy of the final

model. Neural networks can be used for classification (predict the instance label) or prediction of

a continuous-valued output. For classification, one output neurode may be used to represent two

classes (where 1 represent one and 0 another class). If the problem has more than two classes

then one output neurode is used per class [23].

Backpropagation is the most common neural network learning algorithm. Backpropagation

learns by iteratively comparing the prediction output of the training set with the real value. This

value may be a class label in case of classification or a continuous value in case of prediction. Fol-

lowing the weights (wij , θj)of the network are adjusted to minimize the mean square error (MSE)

between the network’s prediction and the actual target value of the instance2. These adjustments

are made by computing the derivative of the error regarding each weight. The learning process

stops when the weights converge. The Backpropagation algorithm can, in a very simple way be

divided in two phases: the propagation and the weight update. After receiving the input param-

eters, the response of an unit is propagated as input to the neurodes in the next layer until the

2The MSE is the most common metric for the error but there are other metrics.

14

2.2 Classification

Input layer Hidden layer Output layer

Figure 2.3: Multilayer feed-forward neural network. A multilayer feed-forward neural network is aset of layers: one input layer, one or more hidden layers and an output layer.

output layer, where the response is obtained from the network and the error is computed as:

Errj = (Oi − Tj)2 (2.17)

where Oj is the observed output of the neurode j and Tj is the known target value for the given

training instance. Now this is done backwards, from the output layer until the first hidden layer the

synaptic weights are adjusted.

A disadvantage of neural networks, besides generally the long training time, is their poor

interpretability. It is difficult Humans, to interpret the symbolic meaning behind the learned weights

and the hidden units of the network. Advantages of neural networks include high tolerance to

noise in data, their ability to find patterns on which they have not been trained for, their ease

of use when little knowledge on the relation between attributes and classes is known, and their

suitability for continuous-value input/outputs as a contrast to decision trees [23].

Support vector machines

Support vector machines (SVMs) is a method for linear classification of data as showed in

Figure 2.4. Non-linear classification can however be achieved by applying a non-linear kernel to

the data; this transforms the data into a higher dimensional space where linear classification can

be applied. In the original space, this results in non-linear classification. Shortly SVM works as

follows [23]: it uses a non-linear mapping φ() to transform the original training data into a higher

dimension Y = φ(X). In the new dimension it searches for the optimal linear hyperplane that

makes the separation. In SVM sense, the optimal hyperplane (W ) is the one that maximizes the

margin (distance) between two classes (in the transformed space), as shown in figure 2.4.

Once the optimal hyperspace is found the classification between two classes can be achieved

15

2. Background

Figure 2.4: Linearly separable data, can be divided with strait line (between the dashes lines).The showed strait line is the Maximum-margin hyperplane.

by computing:

d(Y ) = sign(W · Y + b) (2.18)

To find the optimal hyperplane, a linear combination of training points can be used:

W =∑

αiCiYi (2.19)

where Ci = {1;−1} indicates the true class of instance Yi = φ(Xi); and αi is a coefficient indi-

cating how difficult is to classify instance Xi. Using (2.19) one can re-write the decision function

as:

d(Y ) = sign(∑

αiCiYi · Y + b)

(2.20)

In the non-linear classification case one can define a non-linear transformation kernelK(X1, X2) =

φ(X1)φ(X2). Typical kernels are:

• linear K(Xi, Xj) = XTi Xj .

• polynomial K(Xi, Xj) = (γXTi Xj + r)d, γ > 0.

• radial basic function (RBF) K(Xi, Xj) = exp(−γ||Xi −Xj||2), γ > 0.

• sigmoid K(Xi, Xj) = tanh(γXTi Xj + r).

16

2.2 Classification

where, γ, r and d are kernel parameters. In practice, since the transformation function φ() is not

always known, the decision is performed directly on the original space through the Kernel function

K():

d(Xj) = sign

(∑i

αiCiK(Xi, Xj) + b

)(2.21)

where b is a parameter computed through the support vectors (SVs), i.e., all instances Xi in the

training set that have a corresponding value αi > 0:

b =1

NSV s

∑i∈SV

αiK(Xi, Xj)− Ci (2.22)

where NSV s is the number of support vectors.

To obtain the optimal hyperplane an optimization function can be derived [23]

maximize

L(α) =∑i

αi −1

2

∑i.j

αiαjCiCjK(Xi, Xj) (2.23)

subject to

∀iαi ≥ 0 (2.24)∑i

αiCi = 0 (2.25)

This can be solved by a linear constraint solver algorithm. For some transformation functions, it

may not be possible to obtain a hyperplane that exactly separates all training instances. In such

case slackness variables (ξ) may be added to the optimization function. Two solutions typically

arise from this formulation:

• Box constrains

W (α) =∑i

αi −1

2

∑ij

αiαjCiCj K(Xi, Xj) (2.26a)

0 ≤ αi ≤ C (2.26b)∑i

αiCi = 0 (2.26c)

• Diagonal

W (α) =∑i

αi −1

2

∑ij

αiαjCiCj K(Xi, Xj)−1

2C

∑αiαj (2.27a)

0 ≤ αi (2.27b)∑i

αiCi ≥ 0 (2.27c)

17

2. Background

Binary SVMs classification is the standard technique. But the multi-class classification is

more complicated. An approach to this multi-class classification problem is to use a combina-

tion of several binary SVM classifiers [16]. Examples strategies are: one-versus-all method using

winner-takes-all strategy (WTA SVM) [16]; one-versus one method implemented by max-wins

voting (MWV SVM) [16]; DAGSVM [44]; and error-correcting codes [15].

2.3 Feature Selection

A central problem in machine learning is to identify the set of features that best represent

the data and can be used to construct a classification model for a particular model [21]. Fea-

ture selection is the process of selecting a subset of features for building robust learning models.

Feature selection is particular important when the dataset has a large number of features. By

removing redundant and irrelevant features from the dataset, the feature selection technique aims

to improve the overall performance of the learning models. This improvement results from the re-

duction on the curse of dimensionality [21], increasing the generalization capability of the learning

models (and decreasing overfitting), increasing the speed of the learning process and improving

the model interpretability since less features are used.

Feature selection algorithms are generally divided in two groups: feature ranking and subset

selection. Feature ranking methods rank the features using a metric and remove all features that

do not achieve a sufficiently high score. Subset selection methods search the set of possible

features for the optimal subset.

The optimal solution for supervised learning is to apply an exhaustive search where all features

subsets are tested and the best one is chosen. This is impractical in cases where the dataset has

a large number of features and instances. For practical learning problems, suboptimal solutions

are usually preferred, since an exhaustive search would take too long to compute.

2.3.1 Correlated based

Hall [21] claims that ”feature selection for classification tasks in machine learning can be

accomplished on the basis of correlation between features, and that such a feature selection

procedure can be beneficial to common machine learning algorithms”. The author compare

Correlatation-based Feature selection (CFS) with wrapper methods, and empirically conclude

that CFS generally outperforms the wrapper methods. The also conclude that CFS is faster, often

in 2 orders of magnitude, than wrapper methods.

2.3.2 Minimum redundancy maximum relevance (mRMR)

Peng et al. [43] proposed a feature selection method based on mutual-information techniques.

The authors study how to select good features according to the maximal statistical dependency

18

2.4 Overcoming class imbalance

criterion based on mutual information. The method effectively combines mRMR with wrapper

methods to achieve a very compact subset of features from candidate features at lower com-

putational expense. The authors conclude experimentally that the use of mRMR based feature

selection can significantly improve the accuracy of the classification models.

2.4 Overcoming class imbalance

Imbalanced learning targets a significant amount of problems of interest by academics, indus-

try and governmental agencies [24]. The main problem of learning from imbalanced data sets is

the fact that this imbalance compromises the performance of most standard learning algorithms.

The majority of learning algorithms assumes a balanced class distribution or equal misclassifi-

cation costs. The typical result is a favouring of the predominant class which gives poor class

predictions. This problem is described in [24] as a way to create a highly accurate model for the

minority class without severely damaging the accuracy of the majority class (note that in many

cases the minority class is the most relevant class). In these problems the evaluation of the

classifier is not easy, since one class is highly under-represented. The conventional evaluation

practice of using singular assessment criteria, such as the overall accuracy or the error rate does

not provide an accurate way to evaluate the model in case of imbalanced datasets. For example,

for a data set with a ratio of 1000:1, if every single instance is classified as the majority class, the

error of classification would be 0.1%. More informative assessments metrics must be used, such

as ROC curves, precision-recall curves and cost curves are needed to evaluate the performance

of a classifier in presence of an imbalanced dataset. This problem has application in different

domains like: biomedical, fraud detection, network intrusion, etc. Imbalances can be character-

ized as intrinsic, that is, the nature of the data space has as result the imbalance of the class

distribution. Variable factors as time and storage can also originate imbalanced data sets. We

call this type of imbalance extrinsic, that is, the imbalance is not directly related with the nature of

the data space, but result from the class distribution. It is of great importance to understand the

difference between relative imbalance and imbalance due to rare instances. For example, if we

have a data set with a ratio of 100:1, in a dataset with 100,000 instances, 1,000 of those would

be of the minority class. If we double the number of instances the distribution does not change,

having now 2,000 instances, the minority class is not necessary rare but small in relation to the

majority class. This is an example of a relative imbalance. Haibo He et al.[24] study concluded

that in certain relative imbalanced data sets, the minority class is accurately learned with little

disturbance from the imbalance. This suggests that the imbalance is not the only factor that dif-

ficult the learning. The data complexity is the primary determining factor of deterioration that is

amplified by the relative imbalance. Data complexity is a term that includes issues like overlap-

ping of classes, lack of representative data, small disjuncts and others. Imbalance due to rare

19

2. Background

instances is representative of domains where the minority class instances are very limited. In this

kind of imbalance the learning process will be more complex due to the lack of representativity of

the minority class. Minority concepts may have also sub-concepts with limited instances, that can

make the classification difficulty. This is a new form of imbalance called with-in class imbalance,

which concerts itself with the distribution of the representative data in the sub-concept inside a

class.

2.4.1 Techniques to deal with imbalanced datasets

The class imbalance in the dataset can damage the quality of the classification. If the minority

class is hard to discriminate, the classifier can simply classify every instance as the majority class.

In this case if the minority class only represented 1% of the data, the obtained accuracy would be

99%, although the classifier is useless in the discriminative task.

To overcome this problem, a set of methods have been developed. These techniques have

the objective of minimizing the effect of using imbalanced data. Examples of this techniques

are: sampling by removing, creating or duplicating instances; and cost sensitive methods, where

different costs to the misclassification of the different classes are assigned.

2.4.1.A Random Sampling

Two types of random sampling can be defined. Random under sampling, selects a small

subset of instances from the majority class. Random oversampling, duplicates instances from

the minority class in order to balance the dataset. Although the former can remove important

instances from the dataset, and the latter can lead to overfitting [9], sampling techniques have

been constantly improving to overcame those issues.

For example, random under sampling can be used to remove instances that are far away from

the decision border (using kNN to define the decision border) [9].

2.4.1.B Synthetic Minority Over-sampling Technique (SMOTE)

SMOTE is an over-sampling approach in which the minority class is over-sampled by creating

a synthetic instance rather than by over-sampling with replacement. The minority class is over-

sampled by taking each minority class sample and introducing synthetics instances along the line

segments joining any or all of the k minority class nearest neighbours. Depending of the amount

of over-sampled needed, neighbours from k nearest neighbours are chosen. For example with

k = 5 and 200% oversampling, only two of the five nearest neighbours are chosen and then one

instance is created in the direction of each chosen nearest neighbours. Synthetic instances are

created in the following way: 1)Take the difference between the feature vector (instance) under

consideration and its nearest neighbour; 2) Multiply this difference by a random number(between

0 and 1), and add it to the feature vector under consideration. This makes the selection of a

20

2.5 Model Validation

random point along the line segment between two specific features. This approach effectively

forces the decision region of the minority class to become more general [10].

2.4.1.C Cost sensitive classifiers

Cost sensitive learning is used to learn when different misclassification errors incur different

penalties. In cost sensitive learning a cost matrix must be defined to give a cost to each clas-

sification case (True Positives, False Positive, False Negatives and True Negatives). Having in

account the cost matrix, the classification model is trained to have the lowest accumulative score

[17]. This cost can have a significance, like economically, temporal, severity of an illness, etc.

Cost sensitive classification can be applied to unbalanced datasets, by giving more value, or less

cost, to correctly classify the minority class. This will give more value to the minority class, and

will increased the discrimination of that class.


In a specific classification problem one can ask the question on which method is the best.

Using the train set to make the comparison would lead to misleading overoptimistic results due to

an overspecialisation of the learning algorithm to the data. To avoid this problem, the dataset is

divided into train and test. The accuracy of a model is then evaluated by comparing the results on

the test dataset. The confusion matrix is a helpful tool to analyse how the classifier recognizes the

instances of different classes. In the case of more than two classes, multiple n confusion matrices

are generated, one for each class as positive with the respective complement as negative. To

calculate the quality metrics (accuracy, sensitivity, etc.) a weighted average is used. For example,

if we want to calculate the accuracy in a three classes problem, (C1, C2, C3) we compute∑3

i=1

(Accuracy of Ci * the number of instances labelled with Ci) and divided it by the number of total

instances. One may use a matrix with more than 2 classes. In that case, for each metric, we shall

use for example the target class as positive and as the complement the sum of the other classes.

A confusion matrix for a two class problem is shown in Table 2.1. A technique used to evaluate the

quality of the models is cross validation. Cross-validation is a statistical technique that evaluates

how the model can be generalized to an independent dataset. The dataset is also divided in two

sets, the training set and test set. In case of the model have overfitting this technique will show

us that. The basic form of cross-validation is k-fold cross-validation sets and the use of k = 1 to

train an one to test in each of the k iterations. For example, when k=10, the data is divided in ten

portions (nearly equals) one of the portions is used as test set and the remained as training set.

Since the algorithm evaluates every single portion as test set and as training set, this means the

model will be trained and tested 10 times. And in each one a different test set is used. The most

common k in data mining is 10-fold cross-validation. A common way to evaluate classifier when

21

2. Background

Table 2.1: Confusion matrix.C1 C2

C1 Number of true positives (TP) Number of false negatives (FN)C2 Number of false positives (FP) Number of true negatives (TN)

using a small dataset is Leave-One-Out, where k is the number of instances in the dataset.

True positives are the positive instances that are correctly classified while true negatives are

the negative instances correctly classified. False positives are the negative instances that are

incorrectly classified as positive and false negative are positive instances incorrectly classified as

negative.

• Accuracy

Accuracy is measured by the ratio between the numbers of correctly classified instances

and the total number of instances.

Accuracy =TP + TN

TP + TN + FP + FN(2.28)

• Sensitivity (True Positive Rate or Recall) and Specificity (True Negative Rate)

Sensitivity is the proportion of instances which were classified as class x, among all exam-

ples which truly have class x.

Recall = Sensitivity =TP

TP + FN(2.29)

Specificity is the True Negative Rate that is the proportion of instances which were classified

as class x, but belong to a different class, between all instances which are not of class x.

Specificity =TN

TN + FP(2.30)

• Precision

Precision is the probability of obtaining a relevant result among the subset declared as

positives (TP+ FP):

Precision =TP

TP + FP(2.31)

• F-Measure

The F-Measure is the harmonic mean of precision (P) and Recall (R).

F =2PR

P +R(2.32)

• Receiver Operating Characteristic(ROC) curves and |TPR− FPR|

ROC curves are an useful tool to compare two classifier models. The ROC curve show us

the trade-off between the true positive rate (TPR) rate or sensitivity and the false positive

rate (FPR) rate for a given classifier model. The ROC curve is draw in a FPR vs TPR space

22


(that we designate ROC Space). Each point (FPR, TPR) represents the model using a

different classification threshold. Increasing the classification threshold will result in fewer

false positives (more false negatives), corresponding to the right-left movement of the curve.

The Area under the ROC is a common metric used to compare classifiers. In cases of

probabilistic classifiers (returning the classification probability) this will be an integral of the

area, since we can use the probabilities to variate the threshold and draw the curve. In case

of a non probabilistic classifier, this means a classifier that returns only the predicted class,

the AUC will be simply the TPR+TNR2 or Sensitivity + Specificty

2 , commonly named average

accuracy.

The Area Under the Curve (AUC) is also a good metric for imbalanced learning [24], but is

harder to compute than the |TPR− FPR| in probabilistic models. The AUC will not have in

account the discriminative power of the classifier directly, since a classifier that hasAUC = 0

has more discriminative power than one with AUC = 0.5.

In a ROC space, the line TPR = FPR represents a random classier or a meaningless clas-

sifier. Then a classifier model, m, can be represented in this space as m =(1− Specificity,

Sensitivity). If m is in the point (0, 1) the classifier is perfect, classifying everything correctly.

Or the other hand if m = (1, 0), the classifier is always wrong. In this case by inverting the la-

bel we can get a perfect classifier. In general we are interested in a classification model that

returns a point in the ROC space as far as possible from the random line (TPR = FPR).

This metric will give as the discriminative score of the classifier. Calculating the normalized

Euclidean distance from a point to the random line we get the expression: |TPR−FPR|, or

|Sensitivity − (1 − Specificity)|. This value has a maximum of 1 that represents a perfect

discrimination of the class and a minimum of 0 that represents a random classifier or a clas-

sifier without discriminative power. This metric gives an equal value of classification to both

classes. One advantage of this metric is that it is insensitive to the data imbalacement, in

opposition of the accuracy that overvalues the majority class. The advantage of this metric

versus the analysis of the sensitivity and specificity, is that this discriminative score com-

bines both in a single value, being more simple to analyse. Because of this, the metric is

refereed in the literature as informedness [45].

2.5.1 Overfitting

The overfitting occurs when the model acquires too much detail, or noise from the train set. As

such, the performance of the model in the training instances increases while on the test set (data

never used by the model) the performance becomes worse. In the case of decision trees this

problem can be minimized by pruning the tree, this will transform the model into a more generic

model. Figure 2.5 shows a typical case where the test set error (dashed line) increases after the

23

2. Background

critical point, eventhough the training set error (solid line) maintains a decreasing tendency. In

this point overfitting may have occurred.

Figure 2.5: Overfitting (Error versus Model Complexity). Training Set (solid color line) vs Test Set(dashed line)

Cross-validation

Cross-validation is used to minimize the overfitting problem. This is done since by using 10 train

set instead of only one, this technique can evaluate the model more thoroughly and detect models

that suffers from overfitting. This detection allow us to optimize the model to avoid this.

2.6 Related Work

The problem addressed in this work, predicting the conversion from MCI to AD, was already

studied by different researchers, with different approaches or using different data such as biomark-

ers, Positron emission tomography (PET) scans, among others. A particular case of such work is

the one by Maroco et al [37], which uses a previous version of the data. Used in this thesis.

The work of Clifford Jr el al. [27] addresses the time-to-progression prediction from MCI to AD.

It used 218 subjects with a MCI diagnostic. One or more follow-up assessments were identified

from the Alzheimer’s Disease Neuroimaging Initiative (ADNI). The subjects undergone lumbar

puncture or PIB PET while carrying a MCI diagnostic. This work used a set of methods to pur-

sue their objectives (knowing that the most accepted and validated biomarkers in AD fall in two

categories– imaging and CSF chemical analyses): Magnetic Resonance imaging (MRI) meth-

ods to extract hippocampal volumes and total intra-cranial volumes; Amyloid imaging methods

to extract a global cortical PIB PET retention summary (that combine pre-frontal, orbitofrontal,

parietal, temporal, anterior cingulate and posterior cingulate/precuneus values for each patient);

cerebrospinal fluid methods to quantify biomarker concentrations; and finally statistical methods

to make relations between hippocampal volumes, intracranial volumes, β loads and the presence

of the APOEε4 alleles. Among the 218 patients, with MCI diagnostic, 89 progressed to dementia

24

2.6 Related Work

in an average time of 1.7 years. Important results are that in qualitative terms, age and education

did not change significantly between patients that progress to Dementia and patients that did not.

Although women show a higher occurrence in the progress to Dementia group. Another fact is that

the group that progressed to dementia had a higher proportion of APOE ε4 carriers and a slightly

worse score in MMSE and Clinical Dementia Rating Scale B in comparison with non-progressors

at baseline. The Aβ load and MRI shows that they are highly significant for the progression to AD.

Chapman et al. [8] predicted the conversion from MCI to AD using neuropsychological tests

and neuropsychological tests combined in multivariate ways using essentially two layers of weight-

ing: the weighting applied by the PCA [1] in reorganizing the neuropsychological test measures,

via the correlations among them into the component structure, and the differential weighting for

the component scores added by the discriminate analysis, in computing discriminate coefficients

that are best able to differentiate between the conversion and stable groups. They studied 43 el-

derly patients with a MCI diagnostic. The evaluation of these patients was done by physicians and

met current consensus criteria for the amnestic subtype of MCI. Some tests were performed by

memory-disorders physicians to assist with their diagnoses. These tests included MMSE, a clock

face drawing and the category fluency task. The median time of progression of MCI to a diagno-

sis of AD was of 19.4 months. In this work, all patients that had evidence of stroke, Parkinson’s

disease, HIV/AIDS, reversible dementias, patients medicated with some drugs and patients with

a lower score of a certain threshold in the MMSE test were excluded. The neuropsychological

tests data were normalized to limit the influence of age, education and gender effects. From the

43 patients, 14 were subsequently diagnosed with AD (conversion group) and 29 patients were

not (Staple Group). In the PCA [1] analysis more patients were added: 78 persons with nor-

mal cognition (Control group), 5 with age-associated memory impairment, and 35 MCI patients,

making a total of 216 participants. The PCA was used to develop the component structure from

the neuropsychological test battery. The main advantage of use the PCA was both organizing

similar test measures into components and reducing the number of variables while maintaining

the contributions of all measures. The statistical procedures (from SAS software) used were

MULTTEST, FACTOR, SPEPDISC and DISCRIM. Relevant results obtained were that in general

the conversion group performed worse, in the neuropsychological test, than the stable group,

mainly in retentive memory measures. The PCA analysis on the neuropsychological components

obtained 13 components that have been showed to have a highly discriminatory power in sepa-

rating the AD from the normal ageing process. These 13 components included General Episodic

Memory component, a Generative Fluency component, a Speeded Executive Function compo-

nent, Mood/Activities of Daily Living component, and other components representative of learning

and recognition memory. From the 13 components acquired in PCA only the 11 more important

were used to maintain a 4:1 ratio between the 43 participants and the predictor variables. The 11

components accounted for 72% of the total variance of the data. The PCA component score was

25

2. Background

used to predict the conversion to AD. From these 11 components, six were selected for having the

best discriminatory power, in the stepwise discriminate procedure. They were components are:

Episodic Memory, Speeded Executive Function, Recognition Memory (False Positives), Recogni-

tion Memory (true Positives), Speed in Visuospatial Memory and Visuospatial Episodic Memory.

The results of the prediction: 36/43 patients were correctly classified, accuracy of 83.7%. Of the

14 on the conversion group, 2 were incorrectly predicted to have remained stable, sensitivity of

86%. Of the stable group 24/29 of the patients were correctly predicted, specificity of 83%.

Ewers et al [19] addressed the problem of predicting conversion from MCI to AD dementia

based a biomarkers and neuropsychological test performance. The data used was a dataset of

MRI, CSF and neuropsychological tests. They used as inclusion criteria the score of neuropsy-

chological tests like MMSE, ADAS and Ray auditory verbal learning test, the concrete values are

in [19]. From the CSF, it was extracted the concentration of Aβ1−42, t-tau and ptau181. From the

MRI they got measures of hippocampus volume and entorhinal cortex. From the neuropsycholog-

ical tests scores, the following set of tests were used: Ray Auditory Verbal Learning Test (RAVLT),

tests of frontal lobe functions, trail making test A and B, verbal fluency tests, Boston naming

test and Digit symbol substitution test. They used statistical methods to select some data. All

variables were examined for normal distribution. Some variables were log-transformed to achieve

normal distribution, like some of the measures of the CSF concentration. The technique used was

logistic regression analysis to establish a prediction model for differentiate AD and aged health

control. They created biomarker-only models and they tested if the neuropsychological variables

contributed to the predictive power of the biomarker based model. In the MCI instance, time to

conversion to AD was tested using Cox regression analysis [D’Agostino et al.]. Within the MCI

group 58/130 developed AD in 3.3 years of the clinical follow up, with a mean follow up interval

of 2.3 years. This work created several models: Model with biomarkers and neuropsychological

variables combined with 94.5% of classification accuracy, 93.8% of sensitivity and 95.6% of speci-

ficity; Model with a LRTAA formula (logistic regression model based upon CSF-concentration of

t-tau, Aβ1−42 and number of ApoE ε alleles) combined with neuropsychological tests, this model

has an accuracy of 95.2%, 92.2% of sensitivity and 97.5% of specificity.

Hinrich et al [26] addressed the problem of predicting the progression from MCI to AD. The

data used in this work was taken from ADNI. The data consists of 233 subjects (48 AD, 66 healthy

controls and 119 MCI). The characteristics of the data are: age at baseline, gender, APOE car-

riers, MMSE at baseline, MMSE at 24 months, ADAS at baseline, years of education, geriatric

depression, MR and FDG-PET images. In some cases other biological and neurological data

were available. The model used to analyze the data is based on the multi-kernel learning (MKL)

framework, which allows the addition of an arbitrary number of views of the data in a maximum

margin. The major innovation on MKL is that it learns an optimal combination of kernels matri-

ces while at same time training a classifier. The results obtained by the use of 2-norm MKL are:

26

2.6 Related Work

using imaging modalities 87.6 of accuracy, 78.9 of sensitivity and 93.8 of specificity; using biolog-

ical measures 70.4% of accuracy, 58.1% of sensitivity and 79.4% of specificity; using cognitive

scores 91.2% of accuracy, 89.2% of sensitivity and 92.6% of specificity; using all data combined

92.4% of accuracy, 86.7% of sensitivity and 96.6% of specificity. Combining all data is the best

way to archive the best classification, but using only the cognitive scores show us a very close

classification from the result of all combined data.

Maroco et al [37] is the most important related work, since they used an older version of the

data set used in this work. The problem that this work dealt with is the prediction of evolution

between MCI and AD. Data used consists in a group of 921 elderly non-demented patients with

cognitive complains. Inclusion criterion consists in a diagnostic of MCI and presence in one or

more follow-up neuropsychological assessment or clinical revaluation. Exclusion criterion’s are:

dementia or other disorders that have as effect cognitive impairment, medical treatments inter-

fering with cognitive function and alcohol or illicit drug abuse. In each follow-up the patient was

classified as MCI or AD. The final dataset is composed of 400 patients. The neuropsycholog-

ical predictors used are a subset of tests from Bateria de Lisboa para Avaliacao de Demencia

(Lisbon Test Battery for Dementia Evaluation in English) with a criterion validity of p < 0.1. This

study used 10 classifiers: LDA (Fisher’s Linear Discriminant Analysis) [39], QDA (Quadratic Dis-

criminant Analysis) [39], LR (Linear regression) [41], MLP (MultiLayer Perceptron), SVM (Support

Vector Machines), RBF (Radial Basis Functions), CART (Classification and Regression tree) [5],

CHAID (Chi-squared Automatic Interaction Detector) [29], QUEST (Quick Unbiased Efficient Sta-

tistical Tree) [34], RF (Random Forests) [4]. All classifiers showed a performance better than

chance in the prediction of conversion to dementia of patients with MCI . There was no statis-

tical difference between 8 of the 10 classifiers in the total accuracy, but RF (Mean=0.74) and

SVM (Mean=0.76) performance significantly better. However, they obtained a poor performance

in terms of sensitivity. From the poor performance and the required good sensitivity in this type of

problem, conversion into dementia, the LR, neural networks, SVM and CHAID trees are inappro-

priate for this type of binary classification task. Having in consideration the accuracy, specificity

and sensitivity the Fisher’s Linear Discriminant Analysis does not rank much lower than those

computer intensive methods like MLP or RF. In conclusion this work shows that for this problem

RF and LDA have higher accuracy, sensitivity, specificity and discriminant power on opposite of

SVM, Neural Networks and classification trees.

Table 2.2 synthesizes the related work referred in this chapter together with the information

about the test performed.

27

2. Background

Table 2.2: Synthesis of the related work.

Problem Data MethodsClifford Jr. et al Time to progres-

sionMRI , PIB PETand Neuropsycho-logical tests

Statistical methods

Ewers et al Time to progres-sion and Prognos-tic

Neuropsychologicaltests and Biomark-ers

Statistical methods, Logical re-gression analysis

Chapman et al Prognostic Neuropsychologicaltests

PCA, Statistical methods

Hinrich et al Prognostic Neuropsychologicaltests, FDG PETand MR images

MKL (Multi-Kernel Learning)

Maroco et al Prognostic Neuropsychologicaltests

Fisher’s Linear Discriminant,Quadratic Discriminant Anal-ysis, Linear Regression, MLP,SVM, Radial basis Functions,CART, CHAID, QUEST andRandom Forests

2.7 Summary

In this chapter we described the Alzheimer’s Disease, the tests used to evaluate a patient, in

particular the neuropsychological tests and we analyse the related work. This tests are divided in a

series of battery tests that assess and monitor the patients mental health. The neuropsychological

tests have the several advantages from the biomarkers for example, since are cheaper, more

widely applied and non invasive. But the use of this test also have disadvantage, like the human

factor to evaluate subjective tests. The lack of quality assurance of the data. This problems make

us take more attention in the results analysis, since this data have the tendency by great number

of missing values and imbalance to overfit the models. In this chapter a variate of methods is

described to train and test the models.

28

3Differentiating MCI from AD

(Diagnosis)

Correctly differencing MCI from AD is a key step in the process of predict the conversion from

MCI to AD. In this context, this chapter addresses the diagnosis problem. Two feature selection

techniques are used, techniques for overcoming missing values are studied, and the methodology

described in this chapter is applied.

3.1 Formulation and Methodology

In this chapter the problem formulation and methodology is defined. With this we mean the

formulation of the problems that we aim to solve, in this work, and by methodology the methods

that we use to solve the formulated problems.

3.1.1 Data Description

The Cognitive Complaints Cohort [35, 37] is a prospective study conducted at the Institute of

Molecular Medicine (IMM), Lisbon, to investigate the cognitive stability or evolution to dementia of

subjects with cognitive complaints based on a comprehensive neuropsychological evaluation and

other biomarkers. The criteria for inclusion, exclusion, and diagnosis of the participants as MCI or

AD during follow-up are described in detail in Dina et al. [51]. In this work, we used a revised and

augmented version of this dataset and considered only neuropsychological data.

The original dataset (Table 3.1) has 1641 instances consisting of individual evaluations of 950

distinct patients during their follow-up at IMM. In each evaluation, each patient was classified

29

3. Differentiating MCI from AD (Diagnosis)

by the medical doctors as Normal, preMCI, MCI and AD using clinical criteria. Only instances

labelled as MCI and AD were considered, and from these, only those concerning patients with at

least two evaluations were analysed, since instances corresponding to patients without follow-up

are more likely to be misclassified 1. All instances with a percentage of missing values of at least

90% were removed since these instances have little information. This yielded a dataset with 677

instances labelled as either MCI or AD, where each instance corresponds to a different evaluation

of a set of 337 distinct patients. We note that, since we aim to distinguish between MCI and

AD patients (or in the prognosis the evolution of MCI to AD), we can consider each evaluation

of a patient as a different instance, meaning that we can learn from patients at different disease

stages that are always diagnosed as MCI during follow-up and patients that convert to AD during

follow-up.

After excluding non-informative features, such as ”Patient ID”, and features related with pa-

tient clinical history, such as ”Follow-up Time”, the analysed dataset (details in Table 3.2) is com-

posed of 677 instances described by 157 features/attributes, which can be numerical, categori-

cal/nominal or ordinal. This dataset is highly imbalanced, in the original classes, since approx-

imately 86% of the instances are labelled as MCI. Moreover, missing values, which are around

50% in the overall data, are still an issue as we discuss below. Figure 3.1 presents the histogram

of the missing values per feature, where it can be observed that 60% of all features have more

than 40% of missing values.

Table 3.1: Original Dataset detailsNormal Pre-MCI MCI AD

Group Size (%) 280(17,1%) 63(3,8%) 1147(69,9%) 151(9,2%)Age (M±SD) 64,6±1,1 64,9±10 70,1±10,7 73,6±13,1Sex (Female/Male) 201/78 32/31 679/468 98/52Schooling Years(M±SD) 10±0,2 11,2±5 8,5±4,9 8,7±5,2

Table 3.2: Dataset details after pre-processingMCI AD

Group Size (%) 583 (86.1%) 94 (13.9%)Age (M±SD) 70 ± 8.4 73.3 ± 8.2Sex (Female/Male) 352/231 61/33Schooling Years (M±SD) 8.7 ± 4.9 8.5 ± 5.2

3.1.2 Problems

With the previously discussed data we can formulate three problems: the diagnosis, the prog-

nosis, and the time to conversion. From these only the time to conversion is not discussed in

1It is less likely for patients with follow up instances in the classification frontier to be misclassified with the wrong label.

30


!

"

#!

#"

$!

$"

%!

&'!(')'"('&

&'"(')'#!('&

&'#!(')'#"('&

&'#"(')'$!('&

&'$!(')'$"('&

&'$"(')'%!('&

&'%!(')'%"('&

&'%"(')'*!('&

&'*!(')'*"('&

&'*"(')'"!('&

&'"!(')'""('&

&'""(')'+!('&

&'+!(')'+"('&

&'+"(')',!('&

&',!(')',"('&

&',"(')'-!('&

&'-!(')'-"('&

&'-"(')'.!('&

&'.!(')'."('&

&'."(')'#!!('/

!"#$%&'()')%*+"&%,'

-%&.%/+*0%'()'#1,,1/0'2*3"%,'

Figure 3.1: Histogram of missing values per feature in the original dataset.

this work. The diagnosis is the problem of identifying the disease stage. In our case we aim at

differentiating MCI from AD patients. The MCI class is considered the positive class. This differ-

entiation will be done using neuropsychological data obtained on clinical evaluations, and using

the medical appreciation of the patient in that specific point in time. The relevance of this problem

is to help the medical doctors identifying the disease stage. This problem has an important role,

since the diagnosis is the first step of the medical evaluation of the patient. A correct diagnosis will

help the medical doctor to rapidly adjust the patient care, increasing the life quality of the patient.

The prognosis consists in predicting if a patient will evolve from MCI to AD. For this we use

neuropsychological tests, performed by medical doctors in clinical evaluations. In this problem

the patient must have at least a follow up. The problem of determining if a patient will evolve to

AD is of great importance to the medical doctors since then they can do a preventive treatment to

grant the patient more quality of life and even increase the life expectancy. This problem, which

is discussed in Chapter 5, is more difficult than the diagnosis, since we want to predict the future

evolution of the patient, and each patient is an unique human being with different responses to

the disease.

3.1.3 Methodology

A single data mining methodology can be used used for all problems. In this methodology

we include six classifiers: Naıve Bayes, Gaussian SVM, polynomial SVM, k-nearest neighbour,

C4.5 Decision trees and Artificial neural networks using backpropagation. All classifiers used are

implemented in WEKA.

The imbalance of the data is tackled with a synthetic oversampling technique (SMOTE) [10].

The classifier parameters and the percentage of oversampling is determined using 10-fold cross

validation on a grid search approach. The percentage of oversampling and parameters are com-

bined since SMOTE changes the dataset, and thus so the parameters founded with different

31


SMOTE percentages may not be the same. The SMOTE algorithm is implemented in WEKA.

The high number of features with a high correlation is tacked using feature selection. The fea-

ture selection reduces the effect of the dimensionality curve, increases the discriminating power,

increases the model generalization and reduces the effect of missing data. For the feature se-

lection we applied two techniques correlation-based feature selection [21] and mRMR (Minimum-

redundancy-maximum-relevance) [43]. The correlation-based feature selection is implemented in

WEKA. The mRMR was implemented in Matlab by a team member in the NEUROCLINOMICS

project.

The missing data problem in this dataset is a big problem, since we have approximately 50% of

missing values. The effect is reduced by the feature selection, since it removes highly incomplete

features. But this problem is also dealt within the classifiers in different ways. Some methods have

internally a way to deal with missing data, SVMs uses the median/mode imputation, decisions

tree like C4.5 algorithm use statistical methods that minimize the missing data effects, Neural

Networks turn off the input neurode in case of missing value and kNN assumes the maximum

possible distance in missing data cases.

Model

Each classifier has some type of parameter or a set of them. The best found SMOTE percentage

can change for each problem, subset of features, for the classifier parameter or classifier itself.

The best classifier parameters and SMOTE percentage must be determined having all this in

account. Thus we cross all tested SMOTE percentages with all tested parameter sets in a grid

search. This search is done for each feature selection method applied with a systematic process

to evaluate the parameters and test SMOTE percentages on a defined space. Therefore, an

automated tool was created to deal with this necessity, avoiding mistakes, and in order to optimize

the grid search process. The metric used to compare the models for each parameter set and

SMOTE is the |TPR − FPR|. This metric is obtained by calculating the normalized Euclidean

distance from the point (FPR,TPR) from the random line (FPR=TPR).

The grid search is performed in all classifier models and in all datasets (with and without

feature selection methods); the parameters intervals are showed in Table 3.3. The tested SMOTE

percentage consists in 11 steps from 0% (no oversampling) to the inversion of the imbalanced. All

parameter sets and SMOTE percentages are tested with 10-cross fold cross validation. After the

search six best triples are found {Classifier, < Parameters >, SMOTE percentage} one for

each classifier in a specific dataset with a feature selection method. This triple is tested again in

30 repetitions, using a different seed in the 10-cross validation for each repetition. This will allows

us to perform statistical analysis to the results. The parametrized model is used to find the best

parameters set and SMOTE percentage to all classifiers in each dataset, with and without feature

selection.

32


The SMOTE technique is only applied inside the cross-validation and only to the training set.

If this supervised technique is applied outside the cross-validation or to the full dataset this would

result in over optimist results. In this case we would use synthetic instances to test the model

which would be created from the training set. The feature selection is performed outside the

cross-validation. This is done only due of the need to provide to the medical doctors the selected

set of features used, and because using this technique inside the cross-validation would give us

ten feature sets. Since the feature selection is done outside the cross-validation the results that

use different feature selection methods are not directly comparable, because of the bias given

to the results by using a supervised method to all data. Making a direct comparison would be

misguiding, and would not represent the response of the model to new instances. To tackle this

problem, a final testing model was created, which is described below.

Table 3.3: Grid search parameter intervals.Classifier Parameters

Naıve Bayes Gaussian or SupervisedDiscrimination or KernelRBF SVM Complexity ∈ [1, 10] and γ ∈ [10−5, 102]Poly SVM Complexity ∈ [1, 10] and Degree ∈ [0.5, 5.0]C4.5 DT Confidence ∈ [0.05, 0.5]

ANN Time ∈ [1000, 2000], LearningRate ∈ [0.1, 0.4] andMomentum ∈ [0.1, 0.3]

kNN k ∈ [1, 10]

Data FeatureSelection

Cross-validation

Classifier(model)

Testingset

Training set

Results

Generation of 10 non-overlapping folds

(model training)

(model evaluation)

SMOTESynthetic

oversampling

Figure 3.2: Data Flow used in the parameter grid search for finding the classifiers parameters.The SMOTE percentage is tested with 11 different values for each parameter combination.

Testing Model

For testing the obtained classification model a different data set is used, which was obtained

by splitting the original dataset in 75% of patients for training and 25% of patients for testing. For

this we apply stratification based on: (i) number of evaluations; (ii) age; (iii) sex; (iv) schooling

years and (v) class. Spiting of patients was therefore made such as the distribution of the above

variables is kept constant in the training and testing datasets.

This allows the test set to be used in all problems; diagnosis, prognosis and any future problem

tacked in the NEUROCLINOMICS project. It should be noticed that the test set will not be used

33


Parameters

Classifier (model)Test set

Training set

Results

SMOTESynthetic

oversampling

Figure 3.3: Data Flow used simulate the real world results.

to find the best parameter set. It is only used to evaluate the final models created using the

train set. These models only use the train set to find the best features and the best parameters

for that specific data set. This will allow us to analyse the behaviour of trained models in a ”real

world” simulation, since the model has never been in contact with any instance of the test patients.

Note also that the features and the parameters are only selected using 75% of the data, avoiding

overfitting to the feature selection and parameters grid search. This overfitting would put in cause

the generalization of the model.

Table 3.4: Details of the train set.Normal Pre-MCI MCI AD

Group Size (%) 203(16,6%) 42(3,4%) 856(69,8%) 125(10,2%)Age (M±SD) 65±0,9 63,8±10,1 69,9±10,1 73,4±14Sex (Female/Male) 149/53 21/21 490/366 79/45Schooling Years (M±SD) 10,1±0,2 12,5±4,6 8,7±4,9 8,9±5,3

Table 3.5: Details of the test set.Normal Pre-MCI MCI DEM

Group Size (%) 77(18,6%) 21(5,1%) 291(70,1%) 26(6,3%)Age (M±SD) 63,5±0,6 67,1±9,7 70,8±12,4 74,7±6,8Sex (Female/Male) 52/25 11/10 189/102 19/7Schooling Years (M±SD) 9,7±0,1 8,5±4,9 8±4,8 8±5

Tables 3.4 and 3.5 detail the obtained train and test set. The stratification was done having

in consideration the number of evaluation of the patient in the raw dataset. As can be concluded

from the tables, the distribution of instances in the two sets is almost similar.

3.2 Missing values

In this section, the aim is to study the missing values in a more systematic way, to find out how

missing values impact the classification results. For this we test a variety of strategies to deal with

missing values, such as: use median/mode imputation, use median/mode imputation only from

34

3.2 Missing values

the previously patients instances, use linear regression with the other patients evolution for the

imputation of missing values, use of a single value to describe a missing value.

Missing Minimization We use two strategies to reduce the number of missing values. In the

first strategy we use the average value in that feature of a patient’s other evaluation, or in the

second strategy the linear regression, to determine a value. These strategies will not remove

every missing value but will reduce their number significantly.

Random Assumption In the majority of work done in missing value analysis [12] the assump-

tion of random occurrence is made. In our data we know that this assumption is probably in part

fallacious, a doctor may not perform some test if the patients have a low score in other test, if the

patient is simply too tired, for time restrictions and so long. However, the main techniques were

tested, to observe and test, if this assumption can improve the overall classification, in a set of

classifiers.

Non-random Assumption Now we use the assumption that the missing values do not appear at

random. In fact the existence of a missing value may have a discriminative power. The techniques

to minimize missing values are not used, since now the assumption is that the missing values

are purposeful. For this study all features are now nominals and in some experiments use a

discriminated dataset, created using a supervised algorithm [12]. To study this assumption we use

the value ”MISSING” and discretize the data and then we analyse if there was some discriminative

power improvement.

3.2.1 Experimental Setup

Using the knowledge acquired previously such as the best feature selection technique, and

the Oversampling percentage of the minority class (SMOTE), this setup uses four classification

techniques: a linear SVM, RBF SVM, a C4.5 Decision tree and Naıve Bayes.

The configuration is:

• C4.5 Decision Tree [46] with 0.25 confidence factor.

• Naıve Bayes classifier [28], assuming that each feature probability density function (pdf)

follows a Gaussian function.

• Support Vector Machine (SVM)[30] using either a linear or a radial basis function (RBF)

kernel (γ = 0.001). The complexity parameter that defines the maximum weight of the

support vectors, is for the linear kernel C = 1 and for RBF kernel C = 2.

35


As feature selection the subset covariance method [21] is used. In the random assumption

test we combine the dataset resulted by using the average in the patients evaluation’s or the lin-

ear regression in the patients evaluation’s with two techniques to remove the remaining missing

values. These techniques are implemented in WEKA, the replace of missing values with median

or mode, and the Expectation Maximization imputation. In the non random assumption, the miss-

ing values where replace by the string ”MISSING”. The numerical features now are nominals, for

this we used also a supervised discrimination algorithm to find the best discretization.

3.2.2 Results

0.3

0.4

0.5

0.6

0.7

NB SVM Linear SVM RBF C4.5 DT

Random Assumption

Original Median/Mode EM

AVG AVG + Median/Mode AVG + EM

LR LR + Median/Mode LR + EM

|TP

R-F

PR

|

Figure 3.4: The influence of replace missing values with different techniques, using the metric|TPR − FPR|. The algorithms used for the classification are NB (Naıve Bayes), SVM Linear,SVM RBF(Gaussian) and a C4.5 DT (Decision tree). The imputation techniques are Medianand mode, in numerical and categorical respectively and expectation maximization. The missingminimization techniques are AVG (average) and LR (linear regression) both computed using otherevaluation of the patients.

A comparative study was made to access the influence of this assumption on replacing the

missing values. In Figure 3.4 and 3.5, the comparative study was made using |TPR−FPR| met-

ric, since this metric shows a trade-off between sensitivity e specificity. By analysing the results

the best technique to deal with the missing values, as expect, is dependent of the classification

algorithm.

The best result to each method are:

• In Naıve Bayes, with non random assumption, the discretized dataset shows a slightly better

result that the original dataset, note that this dataset have the missing. But all datasets have

a satisfactory result.

• In Linear SVM, with random assumption using the original dataset yields also the best

result. In this case using the linear regression to minimize the missing values has a similar

36

3.2 Missing values

0.3

0.4

0.5

0.6

0.7

NB SVM Linear SVM RBF C4.5 DT

Non Random Assumption

Original Unique value Discretized Discretized + Unique value

|TP

R-F

PR

|

Figure 3.5: The influence of replace missing values with different techniques, using the metric|TPR − FPR|. The algorithms used for the classification are NB (Naıve Bayes), SVM Linear,SVM RBF(Gaussian) and a C4.5 DT (Decision tree). Using an unique value to represent missingvalues. And using also a discretized data set.

result.

• In RBF SVM, with random assumption, the use of linear regression with median/mode or

average with median/mode is marginality better that using the original data set.

• In C4.5 Decision Tree the effects of dealing with the missing values are more evident. The

best technique is, using the random assumption, the use of median/mode to replace all

missing values. Note that in this classifier the datasets without missing minimization are by

far the best.

• To kNN and Neural Networks the results are not showed, but in both cases the random

assumption with the use of median/mode imputation have the best results.

3.2.3 Conclusions

The techniques used that increase the discriminative power, have been obtained using the

randomness assumption. This does not imply that the missing values are random, only that in

the performed experiments the techniques used with the randomness assumption have better

results. Maybe by using other techniques, and using neuropsycholocical domain information in

the classifiers the assumption of non-randomness would have the expected results. Nevertheless

the gain in discriminative power is not significant, this means the results obtained using the default

way of dealing with the missing values in each classifier are almost similar and in some cases even

better.

37


3.3 Results and Discussion

For the diagnosis problem six triples, for each feature set, have been selected. These triples

are the classifier, parameters set and the SMOTE percentage. The triples are shown in Table 3.6.

These results are obtained using a grid search using only the train set. The box plots with the

classification results in the training set are showed in Figure 3.6.

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

KNN

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

NeuralNetwork

Decision Tree C4.5

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Naive Bayes SVM RBF SVM Poly

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

|TP

R-F

PR

|

|TP

R-F

PR

|

|TP

R-F

PR

|

|TP

R-F

PR

|

|TP

R-F

PR

|

|TP

R-F

PR

|

Figure 3.6: Train results of Diagnosis for 3 features sets.

Additionally, for each of classifiers the following method is used to deal with missing values:

• In Naıve Bayes the internal way of ignoring the missing data is used.

• In SVMs the internal way of median/mode imputation is used.

• In kNN the median/mode imputation is used, and the internal way of using the maximal

possible distance in missing cases is never used.

• In C4.5 Decision Tree the median/mode imputation is used.

• In Neural Networks the median/mode imputation is used.

Table 3.6 presents the best set of parameters when using the grid search. It should be noticed

that, to balance the two classes a synthetic oversampling of 600% would have to be applied.

Naıve Bayes

The Naıve Bayes, by design, is insensitive to class imbalance since the probabilities of belonging

38


Table 3.6: Diagnosis Parameters. Using the training set. The number of features selected are forAll Features 153, Correlation 32 and mRMR 22. In this set the missing values are for All Features45%, Correlation 13% and mRMR 29%

Classifier Feature Selection SMOTE Parameters

Naıve BayesAll Features 0% Supervised DiscritizationCorrelation 0% KernelmRMR 1270% Kernel

SVM RBFAll Features 635% Compl = 2.5 and γ = 0.01Correlation 381% Compl = 4.0 and γ = 0.01mRMR 508% Compl = 2.0 and γ = 0.01

SVM PolyAll Features 1143% Compl = 1.5 and Exp = 1Correlation 1270% Compl = 0.5 and Exp = 4mRMR 1143% Compl = 0.5 and Exp = 3

Neural NetworkAll Features 0% l = 0.3 , m = 0.2 and time = 2000Correlation 0% l = 0.3 , m = 0.1 and time = 2000mRMR 0% l = 0.3 , m = 0.1 and time = 1000

Decision Tree C4.5All Features 508% Conf = 0.05Correlation 1143% Conf = 0.05mRMR 1016% Conf = 0.05

kNNAll Features 508% k = 9Correlation 635% k = 5mRMR 1143% k = 8

to a class are calculated using only that class and then the class that is more probable is chosen.

In this case, oversampling is only used with the mRMR feature set. This allows the classifier to

overcome some data confusion near the decision frontier. The way to deal with the numerical

values for the full set is by using supervised discritization, and for the other feature sets the best

way do deal with the numerical values is by applying kernel density estimation of the probability

density function.

SVM

The Gaussian SVM (SVM RBF) is considerably sensitive to class imbalance. For this reason, grid

search always chooses a synthetic oversampling percentage that nearly balances the classes.

Remarkably, for the polynomial SVM, the grid search leads to an inversion of the class distribution.

However, on average, considering only the training set, the results using a Gaussian kernel are

better than the results using a polynomial kernel.

In the polynomial SVM (SVM Poly) the SMOTE applied inverts the imbalance of the data in

all three cases. The AD class is now overrepresented which shows that this model prefers over-

represented AD class. This will, as side effect, reduce the confusion on the class borders, since

it is now overpopulated with AD instances. Overpopulating the AD instances will increase the

probability to correctly classify the AD instances at the border, consequently the MCI instances in

this border will have higher misclassification. This increase of misclassification is more acceptable

39


that the AD misclassification, since we have in the dataset more MCI instances than AD instances.

The complexity found is relativity small, between 0.5 and 1.5. The polynomial degree with more

features, in the original case, only needs a degree of 1, but the smaller feature sets uses a higher

degree.

Neural Networks

For the Artificial Neural Networks (Neural Networks) case, it can be observed that SMOTE is

always kept at 0%. This shows that artificial neural networks, in all feature sets tested, are not

sensible to the class imbalance. Nevertheless, the median |TPR − FPR| is generally worse in

all feature sets. For the diagnosis case, and having in account the mean |TPR − FPR|, Neural

Networks are the worst model tested.

Decision Tree

In the Decision tree C4.5, the selected SMOTE percentage using all features almost leads to the

balanced state, but for the correlation and mRMR the need for oversampling inverts the balance

of the classes. This shows that for the reduced feature sets, this classifier prefers to have the

AD class overrepresented. The confidence chosen is 0.05 in all features sets. Lowering the

confidence will lower the pruning of the tree. The confidence selected show us that the model will

have little confidence in the dataset.

kNN

k-Nearest Neighbour (kNN) is also sensitive to class imbalance. Thus, the selected SMOTE tend

to the balanced state, except in mRMR case. In the mRMR the best SMOTE case inverted the

data balance. Again this can be explained by the need to define the classification frontiers by

overpopulating it with instances of the least represented class. The number of chosen neighbours

for the full set and mRMR set is large, with 8 and 9, respectively. Again this shows us confusion

in the classification. For the correlated set, the number of neighbours is 5, this shows us a less

confused dataset.

Statistically we can compare the models, using the training set, that use the same features

set. For this analysis we use paired t-test, that we applied an all vs all. The t-test are only applied

if ANOVA test with a 95% confidence confirms the existence of a significant difference.

Using a paired test all vs all approach, in each feature set:

• Original (All Features)

The SVM RBF in all t-test, with a 95% confidence, show us a significantly difference, being

the best one in all cases. The Decision tree is in all cases the worse model, with a 95%

confidence.

40


• Correlation

For the correlation-based feature set, the Naıve Bayes and SVM RBF are the best mod-

els, having no statistical difference between themselves at a 95% confidence. The Neural

networks is the worst model.

• mRMR

For the mRMR feature set, the SVM RBF have a significant difference to all others models

and in all cases it is the best one (at a 95% confidence interval). Again the Neural networks

are the worst model.

Now using the test set that was never seen before, it is possible to evaluate the models be-

haviour in a ”real world” environment. Using this results we can compare models that use different

features, since the features were found without the test set help. Figure 3.7 shows the test results;

the scale of the plot is |TPR− FPR|.

By looking at the test results, we can see that 3 out of the 6 classification algorithms have, for

one feature set, a result |TPR−FPR| > 0.6. These classifiers are: Naıve Bayes with Correlation,

Neural Networks with correlation and kNN with all features. In the SVMs all results using the

different features sets are almost similar (around |TPR−FPR| ≈ 0.5). For the C4.5 Decision tree

the results in the different features sets vary significantly. Results with the original feature set are

considerably worse than for other models (|TPR − FPR| ≈ 0.2). For the kNN, best results are

achieved using all features.

Analysing the results we can see that for each classifier method, the best features can change,

which means that a single feature set is not always the best feature set for all cases, but varies

from model to model. In Table 3.3 we can see for the best classifiers the most common metrics

and the |TPR− FPR|.

Using the train set, the higher median value is |TPR − FPR| ≈ 0.6 using SVM RBF with

correlated-based feature selection (the results with others feature sets are almost similar). How-

ever when using the test set the model now has a |TPR − FPR| ≈ 0.5. The maximum value,

obtained in all algorithms, is |TPR − FPR| ≈ 0.6. This maximum appears in 2 models that use

correlation feature set, the Naıve Bayes and Neural Network. In this case, we can compare the

result, using the train set and using the test set (that simulate the real world with a fully indepen-

dent sample) to analyse the consequence of only using the train set to pick the best model. The

SVM RBF with correlation-based feature selection appears to have the best results. However,

generalization is not so good since in contact with unknown instances its results drops.

Now, we compare the other metric such as accuracy, sensitivity, specificity and Area under the

ROC curve in the test set. By analysing the accuracy results we can see two values above 90%,

the Naıve Bayes and Neural networks. Note that those models have a high |TPR − FPR| score

of about 0.62. But we can see that kNN that also has 0.62 in |TPR − FPR| has an accuracy

only of 78%. Thus by using only accuracy this model would be considered inferior and dismissed.

41


The trade-off sensitivity and specificity are used to calculate |TPR − FPR|. We can see that

Neural Network and Naıve Bayes have a higher score in sensitivity. And that kNN has the higher

specificity but one of the lower sensitivity. Using the AUC (ROC area) we can see that the three

top scores are also the three top scores of the |TPR− FPR|. AUC metric is also sensitive to the

imbalanced data. Is not easy to choose the best model, however the model that has the higher

|TPR− FPR| and area under the ROC is the neural network.

In table 3.8 the result confusion matrix of the neural network model is showed, and we can

see that the majority of the classes are correctly classified.

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

KNN

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

NeuralNetwork

Decision Tree C4.5

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

|TP

R-F

PR

|

|TP

R-F

PR

|

|TP

R-F

PR

|

|TP

R-F

PR

|

|TP

R-F

PR

|

|TP

R-F

PR

|

Figure 3.7: Results for the diagnosis using an independent test set, the scale is |TPR − FPR|where higher is better.

Table 3.7: Best Classifiers for the Diagnosis problem, using the test set.Classifier FS Accuracy Sensitivity Specificity ROC Area |TPR− FPR|

Naıve Bayes Correlation 91% 93% 69% 0.85 0.62SVM RBF Correlation 76% 76% 76% 0.77 0.53SVM Poly Original 78% 79% 69% 0.74 0.48

NN Correlation 92% 94% 69% 0.91 0.62DT C4.5 Correlation 76% 76% 69% 0.76 0.45

kNN Original 78% 78% 85% 0.86 0.62

3.4 Summary

The missing values in the database have been taken in consideration. This problem was

analysed using two assumptions: a random distribution and a non-random distribution. A set of

42

3.4 Summary

Table 3.8: Confusion matrix for the diagnosis problem, using the test set. The algorithm used isArtificial Neural Network with correlation-based feature set.

Predicted MCI Predicted AD

Real MCI 134 9Real AD 4 9

experiences have been performed: using linear regression, expectation maximization, median

mode imputation, and a single value. The conclusion were that the missing are random, and the

results do not significantly improve by using those techniques. Other approaches can be taken

to deal with the missing data like using unsupervised learning to impute missing values on the

clusters. But in this work those approaches were not tested.

To tackle the need of knowing the behaviour of the created models in a real environment, that

is, to the behaviour of the models in contact with new instances, a completely independent test

set was created with the raw dataset. This approach is common in data mining contests were

the train set is given to the competitors and the test set used to compare the submitted models is

provided latter. The metric used to compare the results is the |TPR − FPR| that by taking in the

account the random classification in the ROC space, will give us an unbiased metric. The use of

metrics like precision, f-measure, sensibility, specificity or accuracy would give us biased results

in consequence of the imbalance state of the data.

In this chapter, the diagnosis problem was addressed. For that a methodology was created

to create models using the clinical data, having in account the missing data and the class im-

balance. In this problem we aim at differentiating MCI and AD in an unbalanced data set, with

high dimensionality and with a high percentage of missing data. For this we defined and applied

the methodology described. In the progress of this work we found that, in contact with an inde-

pendent test set, the models that show higher generalization are Naıve Bayes, Artificial Neural

Network and kNN. Furthermore, a single feature subset is not always the best one. This allowed

us to conclude that the best features set depends on the algorithm used, and probably also on

the parameters used. The best models found have a |TPR − FPR| ≈ 0.6. This result shows as

that the diagnosis problem is indeed complex but that we achieve a good discriminative model for

the MCI and AD classes, using state of the art techniques. The best models are Naıve Bayes and

Neural Networks using correlation to select the features, and the kNN using all features. Other

metrics also indicate, in particular the area under the ROC, that the models obtained have a high

discriminative power.

43


44

4Predicting conversion from MCI to

AD (Prognosis)

The prognosis prediction of a patient is of great importance to the medical doctors. It allows

for adequate medical care to the patient and support for the family. The prognosis prediction of

Alzheimer’s Disease (or other cognitive impairment) has also a role in the decisions of the patient

about their future. For example if the conversion to AD occurs in a year, and this patient has a high

responsibility job, e.g, as a company manager or a pilot, the patient can adjust his life to minimize

the impact of his disease on the society.

4.1 Prognosis prediction approach

For prognosis prediction we use two different approaches. The first one, which is, normally

used in similar problems [8] [19] [27] [25], consists in finding if a patient will ever convert to AD.

This approach will be refereed in this work as First and Last Evaluation, since it looks for the first

and last entry of the patient in the database to determine if a patient will ever evolve from MCI to

AD. In this approach, each patient has only one single entry in the post-processed dataset.

The second approach looks at a given temporal window and tries to predict if a patient converts

from MCI (at the beginning of the temporal window) to AD (at the end). For this, and according

to Figure 4.1, a new set of labels has been created: evolution (Evol) and no evolution (noEvol)

instances. The noEvol class is considered the positive class. Notice that any instance with insuffi-

cient knowledge about the outcome is removed in the process, since the behaviour of the disease

is unknown inside the window.

45

4. Predicting conversion from MCI to AD (Prognosis)

To chose the temporal windows two factors were considered: (i) the instances distribution

between classes (Evol/noEvol) and (ii) the medical relevance which was obtained by consulting

the medical partners of the NEUROCLINOMICS project. For the latest case, a period of around

3 years was recommended. For the first case we extracted the class distribution according to

the temporal window size, see Figure 4.2. By analysing the evolution of the labels, in function of

the temporal window size, it can be observed that using a temporal window of 3 years balances

the classes Evol and noEvol. Thus, three temporal windows have been created with one year

of difference: 2 years, 3 years and 4 years. For the first and last evaluation approach after data

pre-processing, the class distribution is 37% for Evol and 63% for noEvol for both the training and

testing sets, respectively, as presented in Table 4.1.

MCI

EVOL

noEVOL

UNK-MCI-MCI

UNK-MCI-?

Preference

+

DEM

t

t

t

t

Figure 4.1: Graphical representation of the new class labels created for the temporal windowsprognosis problem.

0

50

100

150

200

250

300

350

400

0 500 1000 1500 2000 2500 3000

Inst

ance

s

Temporal Window (Days)

Evol

noEvol

UNK-MCI-MCI

UNK-MCI-?

Figure 4.2: Variation of the class labels with the size of the temporal window. This results aredone using all data. Only the Evolution(Evol) and no Evolution (noEvol) are used.

46

4.2 Classification Model

4.2 Classification Model

The classification model used for the prognosis prediction is simpler than the one used for

diagnosis. An independent test and train set has been created and a grid search is applied to find

the best model parameters for the training set. As in the diagnosis, the |TPR− FPR| metric was

used to determine the best model, which considers a balance between specificity and sensitivity.

Also three feature sets are used: Original (i.e., all features), correlation (that are obtained by

using correlated-based feature selection) and mRMR (that uses the mRMR feature selection) and

Figure 4.3 shows the model used in the grid search to find the best parameters using the train

set. The Figure 4.4 shows the model used for testing after having the found the parameters.

Data FeatureSelection

Cross-validation

Classifier(model)

Testingset

Training set

Results

Generation of 10 non-overlapping folds

(model training)

(model evaluation)

SMOTESynthetic

oversampling

Figure 4.3: Data Flow used in the parameter grid search for finding the classifiers parameters.The SMOTE percentage is tested with 11 different values for each parameter combination.

Parameters

Classifier (model)Test set

Training set

Results

SMOTESynthetic

oversampling

Figure 4.4: Data Flow used in simulate the real world results.

To deal with the missing data problem, a study similar to the one made in diagnosis for diag-

nosis was made. The following method is used to deal with missing:

• In Naıve Bayes, the internal way of ignoring the missing data is used, that simply exclude

them from the calculations.

• In SVMs, the internal way of median/mode imputation is used

• In kNN, the median/mode imputation is used, which gives better results than the WEKA’s

default of considering the maximum distance between instances.

47


• In C4.5 Decision Tree, median/mode imputation is used, since the tests showed us that this

method was better than the statistical by default.

• In Neural Networks, median/mode imputation is used is used, since the tests showed us

that this method was better than turn off the input neurode in case of missing value.


4.3.1 First and Last Evaluation

Table 4.1: FirstLastDemographicGroups noEvol Evol

Size 63% 37%Age (Mean ± SD) 68.6 ± 8.7 71.6 ± 8.1Sex (Male/Female) 89/124 41/82Schooling years (Mean ± SD) 8.7 ± 4.9 8.6 ± 5.0Time between assements (year) (Mean ± SD) 2.7 ± 2.3 2.5 ± 1.5

Table 4.2: First and Last Evaluation’s Parameters. Using the train set. The number of featuresselected are for All Features 153, Correlation 32 and mRMR 22. In this set the missing values arefor All Features 43%, Correlation 18% and mRMR 35%.


Naıve BayesAll Features 0% Supervised DiscretizationCorrelation 0% Supervised DiscretizationmRMR 150% Gaussian






Table 4.2 presents the parameters found after performing grid search, as described in Section

4.2. The processed dataset has a minor imbalance, 63% of evolution vs 37% of noEvolution, (see

48


Table 4.1). Nevertheless the oversample technique, SMOTE, was applied with the balance state

being obtained by using approximately 70% of minority class oversampling. For the Naıve Bayes

classifier, oversampling has been used only with the mRMR features. In this case for example,

oversampling inverts the class balance. Another more extreme case example of this, is observed

when using the original set of features with the SVM Poly classifier. In this case, an oversampling

of 500% was chosen, which transforms the minority class in a majority class. Using the Decision

trees and kNN we note that in some cases the used oversampling also completely transforms the

class balance.

Statistically we can compare the models using 30 repetitions, which were obtained by running

the models 30 times with 30 different seeds. For this analysis we used paired t-test, where we

applied an all vs all procedure. The t-tests are only applied if the ANOVA test with a confidence

level of 95% confirms the existence of a significant difference. Using a paired t-test, an all vs all,

with 95% confidence, in each feature set, we observed:


The Naıve Bayes and SVM RBF classifiers, do not have a significant difference. But those

two have a significant difference from all others models, having always a greater mean result.

The kNN only has a statistically insignificant difference to the Neural Network classifier,

having in all other cases a worst mean result.

• Correlation

The Naıve Bayes has a significant difference to all other models, having always a greater

mean result. The Neural Network model has a significant difference, but has in all cases a

worst mean result.

• mRMR

The Naıve Bayes and SVM RBF have again the best results, having a significant difference

from all others models. The model that has the worst result is the kNN.

Table 4.3: Best Classifiers for the First and Last problem, using the test set.Classifier FS Accuracy Sensitivity Specificity ROC Area |TPR− FPR|

Naıve Bayes Correlation 71% 72% 68% 0.67 0.40SVM RBF Original 67% 63% 77% 0.70 0.40SVM Poly Correlation 75% 84% 50% 0.67 0.34

NN Original 66% 70% 55% 0.70 0.25DT C4.5 mRMR 63% 70% 45% 0.59 0.16

kNN Correlation 53% 46% 73% 0.59 0.18

Now looking at the test set results, we can observe that the overall results are disappointing,

although they have a similar level of discriminating power or slightly higher comparing to the ones

49


0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

KNN

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

NeuralNetwork

DecisionTree C4.5

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

NaiveBayes SVM RBF SVM Poly

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

|TP

R-F

PR

|

|TP

R-F

PR

|

|TP

R-F

PR

|

|TP

R-F

PR

|

|TP

R-F

PR

|

|TP

R-F

PR

|

Figure 4.5: Test results of Prognosis using First And Last Evaluations

obtained by Maroco et al [35, 37]. Note however that a true comparison can not be made since the

dataset used in this work is slightly different. In this problem, we can see clearly that Naıve Bayes

and SVMs are the best algorithms in almost all datasets. The best result is |TPR − FPR| ≈ 0.4

using Naıve Bayes with Original features (all features) and using correlated features. In SVM

RBF the same result is obtained using Original feature set (all features). We can see that trying

to discriminate progression is not a simple task. Our results, for this approach, showed us that

in some cases we have some discriminative power but we never get a model with more than

approximately 0.4 in |TPR − FPR|. This means we get a model that is closer to the random

model than to the perfect one. Table 4.3 presents the result considering other metrics besides the

|TPR−FPR|. In all cases the Area under the ROC is below 0.70 and the accuracy is never above

75%. This results are in order with our previous conclusions using the |TPR − FPR| metric: the

obtained models, using this approach, have little predictive power.

4.3.2 Temporal Windows

One of the problems for not achieving better results with the first and last approach may came

from the dataset itself. Since the follow up time differs from patient to patient, confusion may arise

from the fact that a patient has not yet evolved because insufficient time has passed. As we will

see, this problem is mitigated by the temporal windows approach leading to substantially better

results. In accordance with to the medical opinion, three temporal windows are considered: 2, 3

and 4 years.

50


Two Years Temporal Window

Using the two years temporal window, we get a dataset with characteristics described in Table

4.4. In this dataset we have a minor imbalance, having 67% of the evaluations classified as

noEvolution (noEvol) and 33% of the evaluations classified as evolution (Evol). It is interesting to

observe that the age of the evolution group is older that the no evolution group.

Table 4.4: Dataset demographic after applying the pre-processing for the two years temporalwindow.

Groups 2 Years noEvol Evol

Size 181 (67%) 90 (33%)Age (Mean ± SD) 68.8 ± 8.2 72.5 ± 8.2Sex (Male/Female) 74/107 33/57Schooling years (Mean ± SD) 8.8 ± 5.1 8.7 ± 4.8

Table 4.5: Classification model parameters for the prognosis in a temporal window of 2 years. Thenumber of features selected are for All Features 153, Correlation 29 and mRMR 15. In this setthe missing values are for All Features 44%, Correlation 11% and mRMR 43%.


Naıve BayesAll Features 216 % Supervised discritizationCorrelation 108% GaussianmRMR 486% Gaussian


SVM PolyAll Features 432 % Compl = 1.0 and Exp = 1Correlation 486% Compl = 1.0 and Exp = 1mRMR 162% Compl = 0.5 and Exp = 1




For the 2-years dataset the oversample needed to balance the data is approximately 100%.

By analysing Table 4.5 we can see that, in most cases, when oversampling is applied, the per-

centage used inverts the balance of classes. In those cases the oversample helps to define the

decision boundaries. The neural network does not use any oversampling; this is consistent with

the majority of the results that show that this algorithm is not sensitive to oversampling effects.

Statistically we can compare the models, using 30 repetitions. For this analysis we use paired

51


t-test and applied an all vs all strategy. The t-test are only applied if the ANOVA test with a 95%

confidence level confirms the existence of a significant difference. The following conclusions are

obtained:


The two best models are Naıve Bayes and SVM RBF. Both have a significant difference to

the other models and have a higher mean. The Decision Tree model got the worst results,

having a significant difference with the worse mean except in the case of the kNN where no

statistically significant difference was found.

• Correlation

The Naıve Bayes model shows a significant difference to all other models and in all cases

shows a higher mean result. The decision tree model shows, the worst result having in all

cases a significant difference with a lower mean result.

• mRMR

The SVM RBF have a significant difference in all cases with a higher mean result. The

Decision trees are the worst model and only got a significant difference with a higher mean

result with the Neural Networks.

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

KNN

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

NeuralNetwork

Decision Tree C4.5

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

|TPR-FPR|

|TPR-FPR|

|TPR-FPR|

|TPR-FPR|

|TPR-FPR|

|TPR-FPR|

Figure 4.6: Test results of Prognosis using two years temporal window

Analysing the results of the test set presented in Figure 4.6, we can observe that 3 models

have in all datasets good results. Those models are: Naıve Bayes and the SVMs (with a linear and

52


Table 4.6: Best Classifiers for the prognosis problem with a 2 years temporal window, using thetest set.

Classifier FS Accuracy Sensitivity Specificity ROC Area |TPR− FPR|Naıve Bayes Correlation 75% 78% 64% 0.79 0.42

SVM RBF Original 76% 80% 64% 0.72 0.44SVM Poly mRMR 73% 69% 86% 0.78 0.55

NN Original 79% 84% 64% 0.84 0.48DT C4.5 Correlation 60% 63% 50% 0.62 0.13

kNN Correlation 60% 51% 93% 0.74 0.44

RBF kernel). The Neural network, with the original dataset have a result of almost |TPR−FPR| ≈

0.5. However these results are not achieved with the reduced feature sets. This means that the

choice of the reduced feature sets are not particularly well fit for the Neural Networks. Other

interesting result is the kNN using the correlated dataset, where we obtained a |TPR − FPR| ≈

0.45. With same algorithm but using mRMR dataset the result obtained is almost equal to a

random classifier. In this example we can see clearly the influence of the feature selection.

Analysing these results, we conclude that the three models with highest generalization are:

SVM polynomial using the mRMR features, Neural Networks using the Original dataset and kNN

using the correlated dataset. In all those models, a different feature set is used, which allow us

to conclude that the features influence differently the algorithms. The best model is the SVM

polynomial (linear) that uses a mRMR feature selection. Based on the other metrics (see Table

4.6) it can be observed that the Neural networks have the best ROC area, the best accuracy and

higher sensitivity. However their discrimination on the evolution class is lower.

Table 4.7: Dataset demographic after applying the pre-processing for the three years temporalwindow.


Size 122 (50%) 120 (50%)Age (Mean ± SD) 68.7± 7.9 72.5 ± 8.3Sex (Male/Female) 50/72 39/81Schooling years (Mean ± SD) 9 ± 5.2 8.6 ± 4.8

Three Years Temporal Window

In the three year’s dataset the data balance is not a problem since the two classes have the

same representativity. As we can see, the oversampling in the majority of the cases is null or very

reduced. As we already saw, the oversampling has the capability of overpopulating the dataset

to better define the decision frontier. This side effect is the main reason why oversampling is

sometimes used. Nevertheless we have cases where the oversampling percentage is higher,

such as the case with the Naıve Bayes. We can see that in this problem the Naıve Bayes got

53


Table 4.8: Classification model parameters for the prognosis in a temporal window of 3 years. Thenumber of features selected are for All Features 153, Correlation 21 and mRMR 17. In this setthe missing values are for All Features 43%, Correlation 12% and mRMR 42%.


Naıve BayesAll Features 266% Supervised discritizationCorrelation 266% Supervised discritizationmRMR 0% Gaussian






some confusion on the decision borders, that is minimized by applying oversampling to the no

Evolution class. The decision trees have also the same problem in all features sets.

Statistically we can compare the models, using 30 repetitions. For this analysis we use paired

t-test and applied an all vs all strategy. The t-test are only applied if the ANOVA test with a 95%


obtained:


The Naıve Bayes and SVM RBF, do not have a statistical significant difference. But those

two have a significant difference from all others models, having always a higher mean result.

The Decision Tree have in all cases a significantly lower mean result, except versus the

Neural network where no significant difference has been found.

• Correlation

The Naıve Bayes have a significant difference from all others models, having a greater mean

result. The Decision Tree have in all cases a significantly worse mean result.

• mRMR

The SVM RBF have a significant difference from all others models, having always a higher

mean result. The Decision Tree have in all cases a significant difference and worse mean

54


result.

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

KNN

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

NeuralNetwork

DecisionTree C4.5

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

|TP

R-F

PR

|

|TP

R-F

PR

|

|TP

R-F

PR

|

|TP

R-F

PR

|

|TP

R-F

PR

|

|TP

R-F

PR

|

Figure 4.7: Test results of Prognosis using three years temporal window

Table 4.9: Best Classifiers for the prognosis problem with a 3 years temporal window, using thetest set.

Classifier FS Accuracy Sensitivity Specificity ROC Area |TPR− FPR|Naıve Bayes mRMR 84% 94% 70% 0.86 0.64

SVM RBF mRMR 82% 79% 87% 0.83 0.66SVM Poly Original 73% 76% 70% 0.73 0.45

NN Original 71% 73% 70% 0.79 0.42DT C4.5 Original 73% 82% 61% 0.72 0.43

kNN Correlation 77% 76% 78% 0.78 0.54

In this case the dataset is not unbalanced, but oversampling helped to delimit the decision

boundaries. In the three years time window, test results show that all classifiers have a reasonable

behaviour. The best result obtained in this time window is |TPR − FPR| ≈ 0.65, a result that is

closer to the perfect classifier that from the random one. We obtained this result using two models:

the Naıve Bayes and SVM RBF both using the mRMR dataset. The neural network, have also

good results in all datasets, having a |TPR − FPR| ≈ 0.5 in all cases although best results are

achieved using the original set (all features). The kNN also have a good overall behaviour, having

with the correlated dataset a result of |TPR−FPR| ≈ 0.55. The worst results are achieved using

the decision trees. This can be explained since high confidence was obtained using the train set

which caused the model to overfit.

55


It should be noticed that the best results were obtained using no oversampling and using the

mRMR feature set. As suspected, the oversampling has little influence when a balanced dataset

is used. Nevertheless, in same cases, it is beneficial as it is the case of the Naıve Bayes. The

best model is the radial SVM (SVM RBF) using the mRMR feature selection.

Using different metrics (see Table 4.9), we can observe that Naıve Bayes achieves the highest

accuracy, sensitivity ans ROC area. However, its specificity is significantly lower to compared that

of radial SVM.

Table 4.10: Dataset demographic after applying the pre-processing for the four years temporalwindow.


Size 74 (35%) 140(65%)Age (Mean ± SD) 68.4 ± 8.1 71.8 ± 8.3Sex (Male/Female) 33/41 46/94Schooling years (Mean ± SD) 9.0 ± 5.1 8.7 ± 4.8

Table 4.11: Classification model parameters for the prognosis in a temporal window of 4 years.The number of features selected are for All Features 153, Correlation 17 and mRMR 19. In thisset the missing values are for All Features 44%, Correlation 14% and mRMR 43%.


Naıve BayesAll Features 203% Supervised DiscritizationCorrelation 290% Supervised DiscritizationmRMR 29% Gaussian






Four Years Temporal Window

In the four years time window the balance of the data is inverted regarding the two years time

window. The class distribution is now 65% of Evolution and 35% of no Evolution. Table 4.11

56


shows the parameters after the grid search. It can be observed that oversampling is used in

almost all models, except in the Neural Networks; in kNN the oversampling was only used in

the mRMR feature set, and the oversampling value in this case is very low. Note that the used

SMOTE level in the Naıve Bayes and SVMs invert the data balance; in this case the SMOTE will

help the definition of the borders, as it happens in the previously studied cases.

Statistically we can compare the models, using 30 repetitions. As before, we use paired t-

test and applied an all vs all strategy. The t-test are only applied if the ANOVA test with a 95%


obtained:


The Naıve Bayes and SVM RBF, do not have a significant difference, but those two have

a significant difference from all others models, having always a higher mean result. The

Decision Tree have in all cases a significantly worse mean result.

• Correlation

The Naıve Bayes, SVM RBF and kNN do not have a significant difference between them

and have better mean result. The Decision Tree has in all cases a significantly worse mean

result.

• mRMR

The Naıve Bayes and SVM RBF, do not have a significant difference between themselves

but have a significantly difference from all others models, and a higher mean result. The

Decision Tree and Neural Network have no significant difference between them but in all

other cases have a significant difference and a worse mean result.

Table 4.12: Best Classifiers for the prognosis problem with a 4 years temporal window problem,using the test set.

Classifier FS Accuracy Sensitivity Specificity ROC Area |TPR− FPR|Naıve Bayes mRMR 74% 89% 64% 0.84 0.52

SVM RBF mRMR 81% 83% 80% 0.82 0.63SVM Poly Original 77% 78% 76% 0.77 0.54

NN mRMR 77% 72% 80% 0.88 0.52DT C4.5 Original 74% 61% 84% 0.71 0.45

kNN Correlation 79% 72% 84% 0.76 0.56

Using the four years temporal window we also got an overall good performance in the test

set. The best model found is the SVM RBF using the mRMR feature set. For this problem the

mRMR is the best in 3 of the 6 used algorithms. The best results are obtained using the SVM

RBF with mRMR feature selection and kNN with the correlated dataset. These results show us

a classifier closer of the perfect classifier than from the random one. Again we found that the

57


0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

KNN

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

NeuralNetwork

Decision Tree C4.5

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

|TP

R-F

PR

|

|TP

R-F

PR

|

|TP

R-F

PR

|

|TP

R-F

PR

|

|TP

R-F

PR

|

|TP

R-F

PR

|

Figure 4.8: Test results of Prognosis using four years temporal window

classification performance depends of the features and the classification algorithm. The worse

result is obtained when we use the decision trees with mRMR feature set. This show us that with

this feature set the Decision tree does not have a good generalization power. The best model in

this approach is the Gaussian SVM (SVM RBF) using the mRMR feature selection. By analysing

the other metrics (see Table 4.12) we conclude that SVM has a high balanced result with 83%

sensitivity, 80% specificity and 0.82 ROC area.

4.4 Summary

In this chapter, we studied two approaches to process the data in order to predict conversion

from MCI to AD (prognosis), using state of the art methods. The standard method, that in this

work is named as First and Last Evaluation, was tested to define a baseline in discriminative

power. The temporal windows, have been presented as an alternative to this method and as a

way to take into account the different patient profiles that a patience can have in their evaluation

history. With the temporal windows approach, we got a higher discriminative power, using the test

set (pre-processed for each problem). The results show that using temporal windows increases

the prediction capability of the models. In this work, we also compare our approach to the First

and Last evaluation with the Maroco et al [36] work, achieving slightly better results. Using the

temporal windows the results that we got have a greater discriminative power.

In the size of the temporal windows, we got a better results in the test set using the three years

temporal window, as we expected from the medical feedback.

58

4.4 Summary

Table 4.13: Best Models to each progression approachApproach Classifier Accuracy Sensitivity Specificity ROC Area |TPR− FPR|

First & Last Naıve Bayes 71% 72% 68% 0.67 0.42 Years SVM Poly 73% 69% 86% 0.78 0.553 Years SVM RBF 82% 79% 87% 0.83 0.664-Years SVM RBF 81% 83% 80% 0.82 0.63

59


60

5Decision Support System

The models created in chapter 3 and 4 showed an overall good performance in the tasks of

discriminating MCI patients (diagnosis) and predicting the progression from MCI to AD (progno-

sis). But these models are unusable by the medical doctors. To bridge this usability impediment,

a solution was designed and implemented in this thesis, to facilitate the use of the system by

third parties. By integrating the models in an information system, the medical doctors can now

evaluate them in a real work situation. Since this work was done in the context of the NEURO-

CLINOMICS project, the integration of the models into an application that can be used by the

healthcare professionals is of huge importance. Having this in mind, a modular Decision Support

System (DDS) webservices architecture was developed, which integrates with other tools devel-

oped in the project, in particular with the AD information system, that is under development. The

use of web services allows model updating without altering any other part of the system.

The DDS was designed in a modular way and is composed of the following components:(i)

a data input system where healthcare professionals introduce data relative to the instances or

update previous ones;(ii) a prediction system that computes the patient diagnosis (2, 3, 4 years)

based on previously trained models;(iii) an automatic training tool that updates model parame-

ters (feature set selection, oversampling percentage and the algorithm parameters) based on the

complete know instances.

61

5. Decision Support System

5.1 System

Using WEKA, six models have been created for each problem: diagnosis and prognosis in

a temporal window of two, three and four years. These models are parametrized using the tool

created and explained previously. The use has the possibility of choosing any of the models

or even all. For example, the user can choose from the 6 models for a specific problem those

that have more confidence in the results, and present to the final user a confidence interval.

The user can also test a patient evaluation and the system returns the diagnosis prediction, and

if this diagnosis is MCI, then the system show the prognosis results. In this way the system

integrates the diagnosis and prognosis in a simple way. The constant update of the database, will

Figure 5.1: DDS web service architecture

decrease the missing data, since in new evaluations the clinicians are now using higher number

of assessment test (features). This change likely lead system to choose new feature sets, for

example, since features with less missing values are likely more relevant to the classification.

Database updates can also change the class balance. So the models must be updated and

re-parametrized having in account this factors. For this, the parametrizable system, described

in chapter 3, is used to acquire the new parameters and then create new models to reflect the

changes.

The implementation of this DSS was performed using web services. This technology allow

us to deploy the services in an application server, in the network that services can be remotely

accessed by using a defined message scheme, e.g, XML. The DDS also allows easy integration

of new services or updating existing ones.

For this work four webservices have been created, one for the diagnosis and one for each

prognosis temporal window. The webservice receives as argument the patient evaluation and

the model to be used, and returns the confidence of the prediction. The possible models to be

62

5.1 System

Figure 5.2: Prototype data input screen.

used are: Naıve Bayes, SVM (Gaussian or polynomial), kNN, C4.5 decision trees and Artificial

neural networks. Figure 5.1 presents the architecture of the proposed system. The client will

send a request with the patient evaluation by the Network to the webservice that will interrogate

the classification models. In the end a response is given to the client.

Figure 5.3: Prototype output screen.

Using these webservices we can construct applications that allows a client to easily access

the information created by the models. Figure 5.2 shows a prototype scheme of data insertion.

In this prototype there are two boxes, one to select the classification model (Neural Networks,

SVMs, Naıve Bayes, kNN and C4.5 Decision Tree) and other, called ”Patient Evaluation”, were

patient evaluation is inserted. After clicking in ”submit”, the classification request is made to

the webservice. The response to the request is given and the application showed in Figure 5.3

appears.

A system already exists that uses the created webservices, which was created by a NEURO-

CLINOMICS team member in the scope of an AD information system. In Figure 5.4 a screen shot

63


of the system is shown. In the table (of the Figure 5.4) all predictions to all models are displayed;

the values correspond, in the Diagnostic model to the probability of the patients be MCI and for

the prognosis, the probability of the patient not to evolve to AD. The graphic shows the maximum,

average and minimum for each problem (Diagnosis and for the three temporal windows). This

graphic shows confidence intervals created by using multiple models with the same instance.

In Figure 5.4, a real instance is used, the user inquiries the system about the an instance

of the patient 9, evaluated at 22/11/2004. The response indicates that the patient is MCI with a

models average probability of 95%. And the probability of not converting to AD is high in 2, 3 or

4 years (a model average of 90%, 92% or 90% respectively), this means that the evolution to AD

probability is very low, in 4 years.

Figure 5.4: DDS system screen. The values on diagnosis represent the probability of the patientbe MCI, in prognosis case the values represent the probability of not evolve to AD.

5.2 Summary

In this chapter a decision support system is briefly described, using a web services archi-

tecture. The DDS was created to facilitate the use of the system by the medical doctors and

64

5.2 Summary

to integrate the work done in this thesis with the NEUROCLINOMICS project. In this chapter a

prototype is described and a functional application is shown. Since the created system is highly

modular the classification models upgrade is easy and can be done automatically.

65


66

6Conclusions and Future Work

In this work we study the diagnosis and prognosis prediction for patients suffering from Alzheimer’s

disease. To perform this work we use a dataset consisting of neuropsicological numeric tests and

the corresponding diagnosis. Because new neuropsicological tests were added to the dataset

along the time, it suffers from missing values. Furthermore, since there are many more MCI in-

stances in the dataset than AD instances, it also suffers from class imbalance. Thus, in this work

there is the influence of the class imbalance, high dimensionality and missing values. To evaluate

the influence of all these factors an unique approach was designed to create and evaluate the

data. This unique model uses a grid search that combines oversampling and multidimensional

parameters search.

To evaluate the results without a bias, in case of an unbalance dataset the |TPR − FPR|

metric is used in all models of this work. This metric is a trade-off between the sensibility and

specificity.

In this work we tackle the diagnosis problem and create models that discriminate MCI and

AD cases. We analysed the behaviour of a set of supervised data mining algorithms and we con-

cluded that the Naıve Bayes and Neural Networks have a better performance when in contact with

an unknown test set. Those results are obtained using the Original set of features (All Features)

and a Correlated set. We can also conclude that the use of 10-fold-cross-validation provides an

estimate of the goodness of the result that is not analogue to the obtained test set. This can be

caused by an overfitting to the training set, that leads to a low generalization of the models. Simi-

lar results are obtained in all problems analysed. One of the best diagnostic model was obtained

using Naıve Bayes algorithm, with an accuracy of 91%, a sensitivity of 93%, a specificity of 69%,

67

6. Conclusions and Future Work

a ROC Area of 0.85 and a |TPR− FPR| of 0.62.

In the prognosis problem we present a new approach to predicting the conversion from MCI

to AD. The standard method is to use the first and last evaluation of the patient. This approach,

in our opinion, will not use important information, such as profiles that a patient can have in their

evaluation history. In that 10 years a patient can pass though profiles that some other patients

can also have. By using a temporal window approach, we obtain better discriminative results. We

concluded that the best models to use are the Naıve Bayes and SVMs algorithms, and that the

mRMR feature set showed us very good results, generally better than those using the original

set or the correlated set. The temporal window with the highest discriminative power is the 3

years window. Using this window the best model was obtained using radial SVM algorithm, which

has an accuracy of 82%, a sensitivity of 79%, a specificity of 86%, a ROC Area of 0.83 and a

|TPR− FPR| of 0.64.

Finally, we created a Decision support system, that uses the diagnosis and prognosis models.

This system can help the medical doctors to evaluate in a short space of time the patients. This

system was implemented using web services to integrate this work in the NEUROCLINOMICS

project. The use of web services allows this work to be integrated in other works in the scope of

the project since it uses a simple communication protocol.

Future work

To improve the quality of the decision support system, including the prediction models new

approaches and techniques should be investigated. This includes the usage of state-of-art super-

vised or unsupervised data discritization and feature selection techniques, or new classification

techniques, or new classification models including those using boosting approaches. Further-

more, in order to identify patient profiles, unsupervised clustering techniques can be applied. This

would allow the development of specialized models for specific groups of patients and therefore

improve the prediction accuracy. It should be noted that, while some of these techniques were

applied in the course of this thesis, no significant results were achieved. Nonetheless, we believe

that case application of this techniques should be able to identify groups of patients.

Finally, in order to improve the decision support system information, techniques should be

studied to tackle the time to conversion problem, where one predicts how much time will pass

until the patient converts from MCI to AD.

68

Bibliography

[1] Abdi, H. and Williams, L. J. (2010). Principal component analysis. Wiley Interdisciplinary

Reviews: Computational Statistics.

[2] Bekris, L. M., Yu, C.-E., Bird, T. D., and Tsuang, D. W. (2010). Review article: Genetics of

alzheimer disease. Journal of Geriatric Psychiatry and Neurology.

[3] Boonchuay, K., Sinapiromsaran, K., and Lursinsap, C. (2011). Minority split and gain ratio for

a class imbalance.

[4] Breiman, L. (2001). Random forests. Machine learning, 45(1):5–32.

[5] Breiman, L., Friedman J. H., Olshen, R. A., and Stone, C. J. (1984). Classification and

regression trees.

[6] C, G. (1984). Doenca de Alzheimer, problemas do diagnostico clınico. PhD thesis, Faculdade

de Medicina de Lisboa.

[7] Chapman, R., Mapstone, M., McCrary, J., Gardner, M., Porsteinsson, A., Sandoval, T., Guillily,

M., DeGrush, E., and Reilly, L. (2011a). Predicting conversion from mild cognitive impairment

to alzheimer’s disease using neuropsychological tests and multivariate methods. Journal of

Clinical and Experimental Neuropsychology, 33(2):187–199.

[8] Chapman, R. M., Mapstone, M., McCrary, J. W., Gardner, M. N., Porsteinsson, A., Sandoval,

T. C., Guillily, M. D., DeGrush, E., and Reilly, L. A. (2011b). Predicting conversion from mild

cognitive impairment to alzheimer’s disease using neuropsychological tests and multivariate

methods. Journal of Clinical and Experimental Neuropsychology.

[9] Chawla, N. V. (2010). Data Mining for Imbalanced Datasets: An Overview, pages 875–886.

Number 40. Springer US.

[10] Chawla, N. V., Bowyer, K. W., Hall, L. O., and Kegelmeyer, W. P. (2002). SMOTE: Synthetic

Minority Over-sampling TEchnique. Journal of Artificial Intelligence Research 16.

[D’Agostino et al.] D’Agostino, R. B., Lee, M.-L., Belanger, A. J., Cupples, L. A., Anderson, K.,

and Kannel, W. B. Relation of pooled logistic regression to time dependent cox regression

analysis: The framingham heart study.

69

Bibliography

[12] Data, M. (2007). Missing data in clinical. Group, pages 453–460.

[13] de Lemos, L. J. M., Silva, D., Guerreiro, M., Mendonca, A., Tomas, P., and Madeira, S.

(2012a). Discriminating alzheimer?s disease from mild cognitive impairment using neuropsy-

chological data.

[14] de Lemos, L. J. M., Silva, D., Guerreiro, M., Mendonca, A., Tomas, P., and Madeira, S.

(2012b). Predicting conversion from mild cognitive impairment to alzheimer’s disease using

neuropsychological data: Preliminary results.

[15] Dietterich, T. G. and Bakiri, G. (1995). Solving multiclass learning problems via error-

correcting output codes. Journal of Artificial Intelligence Research 2.

[16] Duan, K.-B. and Keerthi, S. S. (2005). Which is the best multiclass svm method? an empirical

study. Springer-Verlag Berlin Heidelberg.

[17] Elkan, C. (2001). The foundations of cost-sensitive learning. International Joint Conference

on Artificial Intelligence, 17(1):973–978.

[18] Ewers, M., Walsh, C., Trojanowski, J., Shaw, L., Petersen, R., Jack Jr, C., Feldman, H.,

Bokde, A., Alexander, G., Scheltens, P., et al. (2010a). Prediction of conversion from mild

cognitive impairment to alzheimer’s disease dementia based upon biomarkers and neuropsy-

chological test performance. Neurobiology of Aging.

[19] Ewers, M., Walsh, C., Trojanowski, J. Q., Shaw, L. M., Petersen, R. C., Jr., C. R. J., Feldman,

H. H., Bokde, A. L., Alexander, G. E., Scheltens, P., Vellas, B., Duboisl, B., Weiner, M., and

Hampel, H. (2010b). Prediction of conversion from mild cognitive impairment to alzheimer’s

disease dementia based upon biomarkers and neuropsychological test performance. Elsevier.

[20] Guerreiro, M., Silva, A. P., B., M. A., L., Castro-Caldas, A., and Garcia, C. (1994). Adaptacao

a populacao portuguesa da traducao do mini mental state examination (mmse). Revista

Portuguesa de Neurologia.

[21] Hall, M. (1999). Correlation-based feature selection for machine learning. PhD thesis, The

University of Waikato.

[22] Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., and Witten, I. H. (2009). The

WEKA data mining software: an update. SIGKDD Explorations, 11(1):10–18.

[23] Han, J. and Kamber, M. (2006). Data Mining:Concepts and Techniques. Diane Cerra, 2

edition.

[24] He, H. and Garcia, E. (2009). Learning from imbalanced data. Knowledge and Data

Engineering, IEEE Transactions on, 21(9):1263 –1284.

70

Bibliography

[25] Hinrichs, C., Singh, V., Xu, G., and Johnson, S. (2011). Predictive markers for ad in a

multi-modality framework: An analysis of mci progression in the adni population. NeuroImage,

55(2):574–589.

[26] Hinrichs, C., Singh, V., Xu, G., and Johnson, S. C. (2010). Predictive markers for ad in a

multi-modality framework: An analysis of mci progression in the adni population. NeuroImage.

[27] Jack Jr, C., Wiste, H., Vemuri, P., Weigand, S., Senjem, M., Zeng, G., Bernstein, M., Gunter,

J., Pankratz, V., Aisen, P., et al. (2010). Brain beta-amyloid measures and magnetic resonance

imaging atrophy both predict time-to-progression from mild cognitive impairment to alzheimer’s

disease. Brain, 133(11):3336–3348.

[28] John, G. and Langley, P. (1995). Estimating continuous distributions in Bayesian classifiers.

In Proceedings of the eleventh conference on uncertainty in artificial intelligence, pages 338–

345. Morgan Kaufmann Publishers Inc.

[29] Kass, G. (1980). An exploratory technique for investigating large quantities of categorical

data. Applied statistics, pages 119–127.

[30] Keerthi, S., Shevade, S., Bhattacharyya, C., and Murthy, K. (2001). Improvements to Platt’s

SMO algorithm for SVM classifier design. Neural Computation, 13(3):637–649.

[31] Kolibas, E., Korinkova, V., Novotny, V., Vajdickova, K., and Hunakova, D. (2000). Adas-cog

(alzheimer’s disease assessment scale-cognitive subscale)–validation of the slovak version.

PubMed.

[32] Liu, H. and Setiono, R. (1996). A probabilistic approach to feature selection-a filter solution.

In Proceedings of the 13th International Conference on Machine Learning, pages 319–327.

Morgan Kaufmann.

[33] Loewenstein, D., Greig, M., Schinka, J., Barker, W., Shen, Q., Potter, E., Raj, A., Brooks, L.,

Varon, D., Schoenberg, M., et al. (2012). An investigation of premci: Subtypes and longitudinal

outcomes. Alzheimer’s and Dementia, 8(3):172–179.

[34] Loh, W. and Shih, Y. (1997). Split selection methods for classification trees. Statistica sinica,

7:815–840.

[35] Maroco, J., Silva, D., Guerreiro, M., de Mendonca, A., and Santana, I. (2011a). Prediction

of dementia patients: A comparative approach using parametric vs. non parametric classifiers.

XIX Congresso Anual da Sociedade Portuguesa de Estatıstica.

[36] Maroco, J., Silva, D., Rodrigues, A., Guerreiro, M., Santana, I., and de Mendonca, A.

(2011b). Data mining methods in the prediction of dementia: A real-data comparison of the

accuracy, sensitivity and specificity of linear discriminant analysis, logistic regression, neural

71

Bibliography

networks, support vector machines, classification trees and random forests. BMC research

notes, 4(1):299.

[37] Maroco, J., Silva, D., Rodrigues, A., Guerreiro, M., Santana, I., and de Mendonca, A.

(2011c). Data mining methods in the prediction of dementia: A real-data comparison of the

accuracy, sensitivity and specificity of linear discriminant analysis, logistic regression, neural

networks, support vector machines, classification trees and random forests. BMC research

notes, 4(1):299.

[38] MF, F., SE, F., and PR., M. (1975). ”mini-mental state”. a practical method for grading the

cognitive state of patients for the clinician. PubMed.

[39] Mika, S., Ratsch, G., Weston, J., Scholkopf, B., and Mullers, K. (1999). Fisher discriminant

analysis with kernels. In Proceedings of the 1999 IEEE Signal Processing Society Workshop

on Neural Networks for Signal Processing, pages 41–48. IEEE.

[40] Mufson, E., Binder, L., Counts, S., DeKosky, S., deToledo Morrell, L., Ginsberg, S.,

Ikonomovic, M., Perez, S., and Scheff, S. (2012). Mild cognitive impairment: pathology and

mechanisms. Acta neuropathologica, 123(1):13–30.

[41] Neter, J., Kutner, M. H., Naschsheim, C., and Wasserman, W. (1996). Applied Linear

Regression Models. The McGraw-Hill Companies.

[42] Noorbakhsh, F., Overall, C. M., and Power, C. (2009). Deciphering complex mechanisms

in neurodegenerative diseases: the advent of systems biology. Trends in Neurosciences,

32(2):88–100.

[43] Peng, H. P. H., Long, F. L. F., and Ding, C. (2005). Feature selection based on mutual in-

formation criteria of max-dependency, max-relevance, and min-redundancy. IEEE Transactions

on Pattern Analysis and Machine Intelligence, 27(8):1226–1238.

[44] Platt, J. C., Cristianin, N., and Shawe-Taylor, J. (2000). Large margin dags for multiclass

classi?cation. Advances in Neural Information Processing Systems 12.

[45] Powers, D. M. W. (2011). Evaluation : From precision , recall and f-measure to roc , informed-

ness , markedness and correlation. Journal of Machine Learning Technologies, 2(1):37–63.

[46] Quinlan, J. (1993). C4.5: Programs for Machine Learning. Morgan kaufmann.

[47] Quinlan, J. R. (1986). Induction of decision trees. Machine Learning, 1(1):81–106.

[48] Raileanu, L. E. and Stoffel, K. (2004). Theoretical comparison between the gini index and

information gain criteria. Annals of Mathematics and Artificial Intelligence, 41(1):77–93.

72

Bibliography

[49] Robert, P., Ferris, S., Gauthier, S., Ihl, R., Winblad, B., and Tennigkeit, F. (2010). Review of

alzheimer’s disease scales: is there a need for a new multi-domain scale for therapy evaluation

in medical practice? Alzheimer’s Research & Therapy.

[50] Samtani, M., Farnum, M., Lobanov, V., Yang, E., Raghavan, N., DiBernardo, A., Narayan,

V., et al. (2011). An improved model for disease progression in patients from the alzheimer’s

disease neuroimaging initiative. The Journal of Clinical Pharmacology.

[51] Silva, D., Guerreiro, M., Maroco, J., Santana, I., Rodrigues, A., Bravo Marques, J., and

de Mendonca, A. (2012). Comparison of four verbal memory tests for the diagnosis and pre-

dictive value of mild cognitive impairment. Dementia and Geriatric Cognitive Disorders Extra,

2(1):120–131.

[52] Silva, D., Santana, I., do Couto, F. S., J Maroco, M. G., and de Mendonca, A. (2008). Cogni-

tive deficits in middle-aged and older adults with bipolar disorder and cognitive complaints:

Comparison with mild cognitive impairment. INTERNATIONAL JOURNAL OF GERIATRIC

PSYCHIATRY.

73

Bibliography

74

AAppendix Medical exams (in

Portuguese)

75

A. Appendix Medical exams (in Portuguese)

Table A.1: Feature List part-1Feature Type Descritpion

Case number for this database Numeric Rank of CasesAge Numeric Age at evaluationDiagNPS String Diagnosis from PsychologistDiagnosis code Numeric Neuropsychological and clinical DiagnosisDisease duration Numeric Evolution of cognitive symptoms in yearsDate Date Date of the evaluationSchool Numeric Years of formal educationGroup Numeric Group in BLAD controlsGender NumericBirth Date Date of birthAs cut Numeric Corte de As cuts (Min=0; Max(pontuacao

melhor)=16)As time Numeric Corte de As time (valor em segundos; quanto

menor melhor)As tot Numeric Corte de As total (Corte de As cuts/Corte de

As time*10)DS forw Numeric Digit Span forward (Min=0; Max(pontuacao

melhor)=9)DS back Numeric Digit Span backward (Min=0; Max(pontuacao

melhor)=8)DS tot Numeric Digit Span total (Digit Span forward+Digit

Span backward)PA Easy Numeric Verbal Paired-Associate Learning Easy

(Min=0; Max(pontuacao melhor)=18)PA Dif Numeric Verbal Paired-Associate Learning Difficult

(Min=0; Max(pontuacao melhor)=12)PA Tot Numeric Verbal Paired-Associate Learning - To-

tal [(Verbal Paired-Associate LearningEasy/2)+Verbal Paired-Associate LearningDifficult]

PA Inter Easy Numeric Verbal Paired-Associate Learning (with Inter-ference) Easy

PA Inter Dif Numeric Verbal Paired-Associate Learning (with Inter-ference) Difficult

LM a Numeric Logical Memory A (Min=0; Max(pontuacaomelhor)=23)

LM b Numeric Logical Memory B (Min=0; Max(pontuacaomelhor)=22)

LM tot Numeric Logical Memory total (Logical MemoryA+Logical Memory B/2)

LM a Cued Numeric Logical Memory A Cued (Min=0;Max(pontuacao melhor)=23)

LM b Cued Numeric Logical Memory B Cued (Min=0;Max(pontuacao melhor)=22

LM a Interf Numeric Logical Memory (with Interference) A (Min=0;Max(pontuacao melhor)=23)

LM b Interf Numeric Logical Memory (with Interference) B (Min=0;Max(pontuacao melhor)=22

LM tot Interf Numeric Logical Memory (with Interference) To-tal [Logical Memory (with Interference)A+Logical Memory (with Interference) B/2]

LM a Interf Cued Numeric Logical Memory (with Interference A) Cued(Min=0; Max(pontuacao melhor)=23)

76

Table A.2: Feature List part-2Feature Type Descritpion

LM a Interf Cued Numeric Logical Memory (with Interference A) Cued (Min=0;Max(pontuacao melhor)=23)

LM b Interf Cued Numeric Logical Memory (with Interference B) Cued (Min=0;Max(pontuacao melhor)=22

MVI Free Numeric Word Recall with Interference Free (Min=0;Max(pontuacao melhor)=15)

MVI Cued Numeric Word Recall with Interference Cued (Min=0;Max(pontuacao melhor)=10)

MVI Rec Numeric Word Recall with Interference Recognition (Min=0;Max(pontuacao melhor)=5)

MVI Tot Numeric Word Recall with Interference total (Word Recall with Inter-ference Free+Word Recall with Interference Cued+WordRecall with Interference Recognition)

Infor Numeric Test about General Information (Min=0; Max(pontuacaomelhor)=20)

VisualM A Numeric Visual Memory (image A) (Min=0; Max(pontuacao mel-hor)=3)

VisualM B Numeric Visual Memory (image B) (Min=0; Max(pontuacao mel-hor)=5)

VisualM C1 Numeric Visual Memory (image C1) (Min=0; Max(pontuacao mel-hor)=3)

VisualM C2 Numeric Visual Memory (image C2) (Min=0; Max(pontuacao mel-hor)=4)

VisualM total Numeric Visual Memory total [Visual Memory (image A)+VisualMemory (image B)+Visual Memory (image C1)+VisualMemory (image C2)]

Or Total Numeric Orientation total (Orientation Personal+Orientation Spa-tial+Orientation Temporal)

Orient P Numeric Orientation Personal (Min=0; Max(pontuacao melhor)=5)Orient S Numeric Orientation Spatial (Min=0; Max(pontuacao melhor)=3)Orient T Numeric Orientation Temporal (Min=0; Max(pontuacao melhor)=7)

Fluency Sem Numeric Verbal Fluency (quanto mais elevada a pontuacao melhor)Fluency Phon Numeric Phonologic Fluency (quanto mais elevada a pontuacao

melhor)M Initiative Numeric Motor Initiative (Min=0; Max(pontuacao melhor)=3)

Gm Initiative Numeric Grafomotor Initiative (Min=0; Max(pontuacao melhor)=2)Writing Numeric Writing (Min=0; Max(pontuacao melhor)=2)Comp Numeric Orders Compreension (Min=0; Max(pontuacao melhor)=4)Ident Numeric Objects Identification (Min=0; Max(pontuacao melhor)=5)

Token T Numeric Token Orders (total) (Min=0; Max(pontuacao melhor)=17)Naming Numeric Naming (Min=0; Max(pontuacao melhor)=7)

Repetition Numeric Repetition (Min=0; Max(pontuacao melhor)=11)Token Complete Numeric Complete version of Token (Min=0; Max(pontuacao mel-

hor)=22)Snodgrass missing Numeric Snodgrass and Vanderwart - numero de palavras em falta

Snodgrass end String Snodgrass and Vanderwart - numero total de palavras ap-resentadas

Public Faces missing Numeric Public Faces - numero de palavras em faltaPublic Faces end String Public Faces -numero total de palavras apresentadas

Prxs Numeric Motor Coordenation (Min=0; Max(pontuacao melhor)=12)Cube Numeric draw of a cube (Min=0; Max(pontuacao melhor)=3)Clock Numeric draw of a clock (Min=0; Max(pontuacao melhor)=3)

77


Table A.3: Feature List part-3Feature Type Description

Calc Numeric Calculation (Min=0; Max(pontuacao melhor)=14)M Calc Numeric Mental Calculation (Min=0; Max(pontuacao melhor)=11)

MPR Numeric Raven Progressive Matrices (Min=0; Max(pontuacao mel-hor)=12)

Proverb Numeric Verbal Abstraction (Min=0; Max(pontuacao melhor)=9)Stroop 1 Numeric STROOP - leitura (max=100; quanto mais elevada a

pontuacao melhor)Stroop 2 Numeric STROOP - nomeacao de cores (max=100; quanto mais

elevada a pontuacao melhor)Stroop 3 Numeric STROOP - interferencia com a palavra (max=100; quanto

mais elevada a pontuacao melhor)JLO Numeric Judgment of Line Orientation - correct answers (quanto

mais elevada a pontuacao melhor)Facial recognition Numeric Facial Recognition Test Record Form - Normal (41-54);

Borderline (39-40); Mod Imp (37-38); Sev Imp (¡37)WAIS cubos Numeric Wechsler Adult Intelligence Scale - cubes (0-52; quanto

mais elevada a pontuacao melhor)WAIS semelhancas String Wechsler Adult Intelligence Scale - similarities (0-26;

quanto mais elevada a pontuacao melhor)WAIS vocabulario String Wechsler Adult Intelligence Scale - vocabulary (0-80;

quanto mais elevada a pontuacao melhor)WAIS codigo String Wechsler Adult Intelligence Scale - symbol search (0-90;

quanto mais elevada a pontuacao melhor)WAIS lacunas Numeric Wechsler Adult Intelligence Scale - picture completion (0-

21; quanto mais elevada a pontuacao melhor)MMSE Numeric Mini-Mental State Examination (Min=0; Max(pontuacao

melhor)=30)TPRT Numeric Toulouse-Pieron (Rendimento de Trabalho=numero de

certos-(omissoes+erros); pontuacao maior melhor)TPID Numeric Toulouse-Pieron (Indice de Dis-

persao=(omissoes+erros)/certos*100; pontuacao menormelhor)

TMT A temp Numeric Trail Making Test (Part A) - Tempo (valor em segundos,se ultrapassar os 180sec normalmente interrompe-se aprova; quanto menos tempo melhor)

TMT A err Numeric Trail Making Test (Part A) - Erros (sem pontuacao maxima,quanto menos melhor)

TMT B temp Numeric Trail Making Test (Part B) - Tempo (valor em segundos,se ultrapassar os 300sec normalmente interrompe-se aprova; quanto menos tempo melhor)

TMT B err Numeric Trail Making Test (Part B) - Erros (sem pontuacao maxima,quanto menos melhor)

TMT B incomplete Numerica1 Numeric CVLT - Lista A - 1.a evocacao (Min=0; Max(pontuacao mel-

hor)=16)a2 Numeric CVLT - Lista A - 2.a evocacao (Min=0; Max(pontuacao mel-

hor)=16a3 Numeric CVLT - Lista A - 3.a evocacao (Min=0; Max(pontuacao mel-



hor)=16

78


a1a5 Numeric CVLT - Lista A de 1 a 5 Total (somatorio das 5 evocacoes;Min=0; Max(pontuacao melhor)=80)

a pers Numeric CVLT - Lista A Perseveracoes (somatorio da repeticaode palavras nas 5 evocacoes; sem pontuacao maxima,quanto menor melhor)

a intr Numeric CVLT - Lista A Intrusoes (somatorio das palavras novasacrescentadas a lista ao longo das 5 evocacoes; sempontuacao maxima, quanto menor melhor)

b tot Numeric CVLT - Lista B (Min=0; Max(pontuacao melhor)=16)b pers Numeric CVLT - Lista B Perseveracoes

b intr Numeric CVLT - Lista B Intrusoesb cs Numeric CVLT - Lista B Cluster Semantic (CS; numero de agrupa-

mentos de palavras da mesma categoria)a cr int Numeric CVLT - Lista A - Evocacao espontanea apos curto intervalo

(Min=0; Max(pontuacao melhor)=16)a crint pers Numeric CVLT - Ev. esp. curto intervalo - perseveracoesa crint intr Numeric CVLT - Ev. esp. curto intervalo - intrusoesa crint cs Numeric CVLT - Ev. esp. curto intervalo - CS

a crint ajsem Numeric CVLT - Evocacao apos curto intervalo com ajudasemantica (Min=0; Max(pontuacao melhor)=16)

a crint ajsem pers Numeric CVLT - Ev. curto intervalo com ajuda semantica -perseveracoes

a crint ajsem intr Numeric CVLT - Ev. curto intervalo com ajuda semantica - intrusoesa lg int Numeric CVLT - Lista A - Evocacao apos longo intervalo (Min=0;

Max(pontuacao melhor)=16)a lgint pers Numeric CVLT - Ev. esp. longo intervalo - perseveracoesa lgint intr Numeric CVLT - Ev. esp. longo intervalo - intrusoesa lgint cs Numeric CVLT - Ev. esp. longo intervalo - CS

a lgint ajsem Numeric CVLT - Evocacao apos longo intervalo com ajudasemantica (Min=0; Max(pontuacao melhor)=16)

a lgint ajsem pers Numeric CVLT - Ev. longo intervalo com ajuda semantica -perseveracoes

a lgint ajsem intr Numeric CVLT - Ev. longo intervalo com ajuda semantica - in-trusoes

rec a Numeric CVLT - Reconhecimento apos longo intervalo Lista A(Min=0; Max(pontuacao melhor)=16)

rec Bp Numeric CVLT - Reconhecimento - Lista B partilhados(Min=0(pontuacao melhor); Max=4)

rec Bn Numeric CVLT - Reconhecimento - Lista B nao partilhados(Min=0(pontuacao melhor); Max=4)

rec P Numeric CVLT - Reconhecimento - Prototipo (Min=0(pontuacaomelhor); Max=4)

rec sr Numeric CVLT - Reconhecimento - sem relacao (Min=0(pontuacaomelhor); Max=16)

GDS Numeric Geriatric Depression Scale (Min=0(pontuacao melhor);Max=15)

QSM Total Numeric Escala de Queixas Subjectivas de Memoria(Min=0(pontuacao melhor); Max=22)

BlessedAVD Numeric Blessed (Total of Part 1 - Daily living activities)(Min=0(pontuacao melhor); Max=8)

BlessedHAB Numeric Blessed (Total of Part 2 - Habits) (Min=0; Max=9)BlessedPERS Numeric Blessed (Total of Part 3 - Personality) (Min=0; Max=11)

79



BlessedTOT Numeric Blessed TOTAL (Blessed (Total of Part 1 -Daily living activities)+Blessed (Total of Part2 - Habits)+Blessed (Total of Part 3 - Person-ality))

CancellationTask Z Numeric Nota ZDigitSpan Z Numeric Nota Z

DigitSpan forward Z Numeric Nota ZDigitSpan backward Z Numeric Nota Z

SemanticFluency Z Numeric Nota ZMotorInitiative Z Numeric Nota Z

GraphomotorInitiative Z Numeric Nota ZComprehension Z Numeric Nota Z

Identification Z Numeric Nota ZToken Z Numeric Nota Z

Naming Z Numeric Nota ZRepetition Z Numeric Nota Z

Writing Z Numeric Nota ZOrientation Z Numeric Nota ZWordRecall Z Numeric Nota Z

GeneralInformation Z Numeric Nota ZVerbalPaired AssociateLearning Z Numeric Nota Z

LogicalMemory Z Numeric Nota ZLogicalMemory A Z Numeric Nota Z

LM DR Z Numeric Nota ZVisualMemory Z Numeric Nota Z

Cube Z Numeric Nota ZClock Z Numeric Nota Z

CubesWAIS Z Numeric Nota ZCalculation Z Numeric Nota Z

MPR Z Numeric Nota ZProverbs Z Numeric Nota Z

TP RT Z Numeric Nota ZTP ID Z Numeric Nota Z

TMT A Z Numeric Nota ZTMT B Z Numeric Nota Z

A1 Z Numeric Nota ZA5 Z Numeric Nota Z

Atot Z Numeric Nota ZB Z Numeric Nota Z

SDFR Z Numeric Nota ZSDCR Z Numeric Nota ZLDFR Z Numeric Nota ZLDCR Z Numeric Nota Z

REC Z Numeric Nota ZToken Complete Z Numeric Nota Z

80

BAppendix Diagnosis

81

B. Appendix Diagnosis

Table B.1: Selected Features, for diagnosis, using correlation and mRMR.Correlation mRMR

As cut rec aAs tot ProverbLM a Cued WritingLM a Interf PA Inter DifInfor NamingOr Total a crint ajsem persOrient P a crint persOrient S DS backOrient T BlessedHABFluency Sem MPRWriting CompNaming Orient SCube M CalcClock Orient TMPR a lgint csBlessedAVD Gm InitiativeBlessedTOT BlessedAVDCancellationTask Z MVI FreeDigitSpan Z Writing ZDigitSpan backward Z ClockSemanticFluency Z Naming ZGraphomotorInitiative Z Orient POrientation ZWordRecall ZGeneralInformation ZVerbalPaired AssociateLearning ZLogicalMemory A ZClock ZCalculation ZMPR ZProverbs ZAtot Z

82

CAppendix Prognosis

C.1 First and Last Evaluation

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

KNN

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

NeuralNetwork

DecisionTree C4.5

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

|TP

R-F

PR

|

|TP

R-F

PR

|

|TP

R-F

PR

|

|TP

R-F

PR

|

|TP

R-F

PR

|

|TP

R-F

PR

|

Figure C.1: Train results of Prognosis using First and Last Evaluations

83

C. Appendix Prognosis

Table C.1: Selected Features, for the prognosis using the First and last evaluation. The usedtechniques are correlation[32] and mRMR [43]

Correlation mRMR

Age VisualM BAs time ClockAs tot As cutPA Dif Orient PLM a b csLM b LM b Interf CuedLM tot GenderLM a Interf a lg intMVI Free TMT B incompleteOr Total Gm InitiativeOrient T MVI FreeCube a lgint ajsem persMPR Repetitiona1a5 TMT B Za lg int Orient SLogicalMemory Z NamingLogicalMemory A Z CubeProverbs Z a lgint csTMT B Z Orient TAtot Z Writing

CalcBlessedHABa lgint persPA Inter DifCompToken TPrxsComprehension ZTP RT Z

C.2 Temporal window: Two years

C.3 Temporal window: Three years

C.4 Time window: Four years

84


Table C.2: Selected Features, for the prognosis using the two years temporal window. The usedtechniques are correlation [32] and mRMR [43]

Correlation mRMR

Age WritingPA Easy Orient PPA Tot DS backLM a TMT A errLM a Interf a lgint ajsemMVI Free NamingOr Total Or TotalFluency Sem CompCalc MVI FreeCancellationTask Z CalcOrientation Z a crint persGeneralInformation Z CubeVerbalPaired AssociateLearning Z TMT B incompleteCube Z b csMPR Z Orient SAtot Z PA Inter Dif

BlessedHABM CalcTMT B temp

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

KNN

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

NeuralNetwork

Decision Tree C4.5

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

|TP

R-F

PR

|

|TP

R-F

PR

|

|TP

R-F

PR

|

|TP

R-F

PR

|

|TP

R-F

PR

|

|TP

R-F

PR

|

Figure C.2: Train results of Prognosis using two years temporal window

85


Table C.3: Selected Features, for the prognosis using the three years temporal window. The usedtechniques are correlation [32] and mRMR [43].

Correlation mRMR

Age DS backDS back a crint ajsem intrPA Easy CompPA Tot a3LM a Cued NamingLM a Interf b csLM tot Interf Or TotalMVI Free CubeOrient T MVI FreeFluency Sem a crint persCalc M Calca2 BlessedHABCancellationTask Z PA Inter DifDigitSpan Z a crint csOrientation Z TMT B incompleteWordRecall Z WritingGeneralInformation Z TPIDVerbalPaired AssociateLearning ZLogicalMemory A ZCube ZMPR Z

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

KNN

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

NeuralNetwork

Decision Tree C4.5

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

|TP

R-F

PR

|

|TP

R-F

PR

|

|TP

R-F

PR

|

|TP

R-F

PR

|

|TP

R-F

PR

|

|TP

R-F

PR

|

Figure C.3: Train results of Prognosis using three years temporal window

86


Table C.4: Selected Features, for the prognosis using the Four years temporal window. The usedtechniques are correlation[32] and mRMR [43].

Correlation mRMR

PA Tot Orient PLM a Fluency PhonLM a Cued LDCR ZLM a Interf Writing ZLM tot Interf CompMVI Free b csInfor LM b Interf CuedOrient T MVI FreeFluency Sem TMT A errCube Orient TBlessedAVD NamingSemanticFluency Z CubeOrientation Z a crint csWordRecall Z M CalcVerbalPaired AssociateLearning Z a lgint persMPR Z PA Inter DifA5 Z TMT B incomplete

BlessedHABTPRT

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

KNN

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

NeuralNetwork

Decision Tree C4.5

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

|TP

R-F

PR

|

|TP

R-F

PR

|

|TP

R-F

PR

|

|TP

R-F

PR

|

|TP

R-F

PR

|

|TP

R-F

PR

|

Figure C.4: Train results of Prognosis using four years temporal window

87


88

a data mining approach to predict conversion from mild cognitive impairment … · abstract...

Documents