a data mining approach to predict conversion from mild cognitive impairment … · abstract...
TRANSCRIPT
A data mining approach to predict conversion from mildcognitive impairment to Alzheimer’s Disease
Luıs Jorge Matias de Lemos
Dissertation submitted to obtain the Master Degree inInformation Systems and Computer Engineering
JuryPresident: Doctor Jose Carlos Alves Pereira MonteiroSupervisor: Doctor Sara Alexandra Cordeiro MadeiraCo-supervisor: Doctor Pedro Filipe Zeferino TomasMembers: Doctor Claudia Martins Antunes
Doctor Alexandre Valerio de Mendonca
November 2012
Acknowledgments
First, I would like to thank my family for the given support.
Secondly, I would like to thank my advisors, Sara Madeira and Pedro Tomas, for having a
fundamental and difficult role in this work by guiding and motivating me. I would like to say thanks
to all the members of the NEUROCLINOMICS team by the hours that they spent listing to my
presentations and providing valuable feedback.
My thanks to all my colleagues from room 128/425, and to Andre Silva for helping me with the
english.
This work was partially supported by FCT - Fundacao para a Ciencia e a Tecnologia under
project PTDC/EIA-EIA/111239/2009 (NEUROCLINOMICS - Understanding NEUROdegenerative
diseases through CLINical and OMICS data integration).
Finally a special thanks to the author of the template used in this thesis, Pedro Tomas.
Abstract
Alzheimer’s disease (AD) is a well known neurodegenerative disease causing cognitive impair-
ment. Despite being one of the best studied diseases of the central nervous system, it remains
incurable. Mild Cognitive Impairment (MCI) is currently considered to be an early stage of a neu-
rodegenerative disease. Patients diagnosed with MCI are assumed to have higher risk to evolve
to AD. In this context, the correct diagnosis of MCI and an effective assessment of its predictive
value for the conversion to AD are crucial.
In this thesis, neuropsychological data is used to distinguish patients with MCI from those
already suffering from AD and to predict the evolution of MCI patients to AD in time windows of 2,
3 and 4 years. We analyse a dataset with patients labelled by the clinicians as MCI or AD. As in
the case for most real clinical data, this dataset is strongly imbalanced and has a high percentage
of missing values.
We use state of the art supervised learning techniques and perform a throughout study on the
effect of class imbalance and missing values in their performance. Since the number of attributes
is large, feature selection is studied and used to effectively decrease the dimensionality of the
problem. A data mining methodology has been created to automatize the oversampling and
parameters search. The created tool automatically creates the models and parametrizes them,
having in account the balance state of the data. The obtained results indicate that the developed
models achieve an accuracy of 91% in patient diagnosis and up to 82% in patient prognosis.
To help healthcare professionals, a decision support system was created that includes the
diagnosis and prognosis models developed in this work.
Keywords
Alzheimer’s Disease, Data Mining, Temporal Windows, diagnosis, prognosis, prediction.
iii
Resumo
A doenca de Alzheimer (DA) e uma conhecida doenca neuro degenerativa causando uma
deficiencia cognitiva. Apesar de ser uma das doencas mais estudas do sistema nervosos central,
contınua sem cura. A deficiencia cognitiva ligeira (DCL) e considerada um estado inicial de
uma doenca neuro degenerativa. Assume-se que pacientes diagnosticados com DCL, tem um
maior risco de evolucao para DA. Neste contexto um correto diagnostico e uma analise eficaz da
probabilidade de conversao sao cruciais.
Nesta tese, dados neuro psicologicos sao utilizados para distinguir os pacientes com DCI
daqueles com DA e para prever a conversao dos doentes com DCI para DA, numa janela temporal
de 2, 3 e 4 anos. O conjunto de dados analisado foi classificado por medicos como DCI e DA.
Como em qualquer conjunto de dados clınicos reais, estes sao extremamente desbalanceados e
com um elevado numero de valores omissos.
Foram utilizados tecnicas, do estado da arte, de aprendizagem supervisionada, tambem foi
realizado um estudo sobre o efeito na classificacao do desbalanceamento de classes e dos val-
ores omissos. Como o numero de atributos e grande, tecnicas de selecao de atributos foram
utilizados para diminuir a dimensionalidade do problema. Uma metodologia foi criada para au-
tomatizar o processo de sobre amostragem e de procura de parametros. A ferramenta criada,
automaticamente cria os modelos de classificacao e parametriza-os, tendo em conta o estado de
balanceamento dos dados. Os resultados obtidos indicam que os modelos desenvolvidos obtem
uma precisao, no diagnostico de pacientes, na ordem dos 91% e no prognostico na ordem dos
82%.
Para auxiliar os profissionais de saude, um sistema de apoio a decisao foi criado e colocado a
v
disposicao destes, com os modelos de diagnostico e prognostico criadas no decorrer desta tese.
Palavras Chave
Doenca de Alzheimer, Mineracao de Dados, Janelas temporais, Diagnostico, Prognostico,
Predicao.
vi
Contents
1 Introduction 1
1.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Dissertation outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Background 5
2.1 Alzheimer’s Disease . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.1 Neuropsychological tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3.1 Correlated based . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3.2 Minimum redundancy maximum relevance (mRMR) . . . . . . . . . . . . . 18
2.4 Overcoming class imbalance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.4.1 Techniques to deal with imbalanced datasets . . . . . . . . . . . . . . . . . 20
2.4.1.A Random Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.4.1.B Synthetic Minority Over-sampling Technique (SMOTE) . . . . . . 20
2.4.1.C Cost sensitive classifiers . . . . . . . . . . . . . . . . . . . . . . . 21
2.5 Model Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.5.1 Overfitting
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3 Differentiating MCI from AD (Diagnosis) 29
3.1 Formulation and Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.1.1 Data Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.1.2 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.1.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2 Missing values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
vii
Contents
3.2.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.2.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.3 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4 Predicting conversion from MCI to AD (Prognosis) 45
4.1 Prognosis prediction approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.2 Classification Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.3 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.3.1 First and Last Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.3.2 Temporal Windows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5 Decision Support System 61
5.1 System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.2 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
6 Conclusions and Future Work 67
A Appendix Medical exams (in Portuguese) 75
B Appendix Diagnosis 81
C Appendix Prognosis 83
C.1 First and Last Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
C.2 Temporal window: Two years . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
C.3 Temporal window: Three years . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
C.4 Time window: Four years . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
viii
List of Figures
2.1 Decision tree that using 2 attributes . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 Neurode representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3 Multilayer feed-forward neural network. . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.4 SVM linear separable data representation . . . . . . . . . . . . . . . . . . . . . . . 16
2.5 Overfitting (Error versus Model Complexity) . . . . . . . . . . . . . . . . . . . . . . 24
3.1 Histogram of missing values per feature in the original dataset. . . . . . . . . . . . 31
3.2 Classification model for the training data. . . . . . . . . . . . . . . . . . . . . . . . 33
3.3 Model that simulate the real work usage. . . . . . . . . . . . . . . . . . . . . . . . . 34
3.4 Missing replacing using random assumption. . . . . . . . . . . . . . . . . . . . . . 36
3.5 Missing replace using non random assumption. . . . . . . . . . . . . . . . . . . . . 37
3.6 Train results of Diagnosis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.7 Results for the diagnosis using an independent test set . . . . . . . . . . . . . . . 42
4.1 Representation of the prognosis class labels . . . . . . . . . . . . . . . . . . . . . 46
4.2 Variation of the prognosis class labels . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.3 Classification model for the training data. . . . . . . . . . . . . . . . . . . . . . . . 47
4.4 Model that simulate the real work usage. . . . . . . . . . . . . . . . . . . . . . . . . 47
4.5 Test results of Prognosis using First And Last Evaluations . . . . . . . . . . . . . . 50
4.6 Test results of Prognosis using two years temporal window . . . . . . . . . . . . . 52
4.7 Test results of Prognosis using three years temporal window . . . . . . . . . . . . 55
4.8 Test results of Prognosis using four years temporal window . . . . . . . . . . . . . 58
5.1 DDS web service architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.2 Prototype data input screen. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.3 Prototype output screen. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.4 DDS system screen. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
C.1 Train results of Prognosis using First and Last Evaluations . . . . . . . . . . . . . . 83
C.2 Train results of Prognosis using two years temporal window . . . . . . . . . . . . . 85
C.3 Train results of Prognosis using three years temporal window . . . . . . . . . . . . 86
ix
List of Figures
C.4 Train results of Prognosis using four years temporal window . . . . . . . . . . . . 87
x
List of Tables
2.1 Confusion matrix. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.2 Synthesis of the related work. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.1 Original Dataset details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.2 Dataset details after pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.3 Grid search parameter intervals. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.4 Details of the train set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.5 Details of the test set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.6 Diagnosis Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.7 Best Classifiers for the Diagnosis problem. . . . . . . . . . . . . . . . . . . . . . . 42
3.8 Confusion matrix for the diagnosis problem. . . . . . . . . . . . . . . . . . . . . . . 43
4.1 FirstLastDemographic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.2 First and Last Evaluation’s Parameters. . . . . . . . . . . . . . . . . . . . . . . . . 48
4.3 Best Classifiers for the Prognosis problem First and Last approach. . . . . . . . . . 49
4.4 Two Years Temporal Window Demographic . . . . . . . . . . . . . . . . . . . . . . 51
4.5 Classification model parameters for the prognosis in a temporal window of 2 years. 51
4.6 Best Classifiers for the prognosis problem with a 2 years temporal window. . . . . 53
4.7 Three Years Temporal Window Demographic . . . . . . . . . . . . . . . . . . . . . 53
4.8 Classification model parameters for the prognosis in a temporal window of 3 years . 54
4.9 Best Classifiers for the prognosis problem with a 3 years temporal window. . . . . 55
4.10 Four Years Temporal Window Demographic . . . . . . . . . . . . . . . . . . . . . . 56
4.11 Classification model parameters for the prognosis in a temporal window of 4 years. 56
4.12 Best Classifiers for the prognosis problem with a 4 years temporal window. . . . . 57
4.13 Best Models to each progression approach . . . . . . . . . . . . . . . . . . . . . . 59
A.1 Feature List part-1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
A.2 Feature List part-2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
A.3 Feature List part-3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
A.4 Feature List part-4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
xi
List of Tables
A.5 Feature List part-5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
B.1 Selected Features for diagnosis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
C.1 Selected Features for Prognosis using first and last evaluations. . . . . . . . . . . 84
C.2 Selected Features for Prognosis using the two years temporal window. . . . . . . . 85
C.3 Selected Features for Prognosis using the three years temporal window. . . . . . . 86
C.4 Selected Features for Prognosis using the four years temporal window. . . . . . . . 87
xii
1Introduction
Declines in cognitive and motors functions, together with other evidences of neurological de-
generation, become increasingly likely as healthy people age. The fact is that everyone will ex-
perience altered brain functions, although some at an earlier age or at a faster rate than others.
As such, distinguishing the motor and cognitive declines of normal ageing from those due to
pathological processes and understanding the individualized disease diagnostic and prognostic
patterns are ongoing research challenges [42]. In this context, Alzheimer’s disease (AD), a well
known neurodegenerative disease causing cognitive impairment, is amongst the best studied dis-
eases of the central nervous system due to its devastating effect on patients and their family, and
to the socio-economic impact in modern societies. Nevertheless, it remains uncurable.
Every year millions of new Alzheimer’s Disease cases are diagnosed. The result is dementia
on elderly individuals; internment and expensive medical care are a common outcome. An early
diagnostic and prognosis can improve the patient quality of life, minimizing the need for internment
and expensive medical care, reducing the patient and the family’s suffering and minimizing the
social-economic effects on the society. In this context, finding out if and when a patient will
progress from Mild Cognitive Impairment (MCI) to Alzheimer’s Disease is of a major importance
to the timely administration of pharmaceutics and therapeutic interventions. Furthermore it allows
medical doctors to adjust periodicity of medical consults.
Mild Cognitive Impairment is currently considered to be an early stage of a neurodegenerative
disease, particularly AD. Patients diagnosed with MCI are regarded with special attention since
they are assumed to have higher risk to evolve to dementia, usually AD [51]. Under these as-
sumptions, the correct diagnosis of MCI conditions and an effective assessment of its predictive
1
1. Introduction
value for the conversion to AD are thus of major importance. However, the definition of MCI and
its diagnosis criteria are not yet consensual; the pathologic and molecular substrate of people
diagnosed with MCI is not well established [40]. Moreover, people considered to be suffering
from preMCI, that is people having cognitive complains but not fulfilling the criteria for MCI, have
recently been shown to have high risk of progression to MCI and AD [33]. This makes the di-
agnosis of MCI a difficult task in itself and consequently transforms the prediction of MCI to AD
conversions into an even more complicated task.
Neuropsychological tests have been used by medical doctors mainly because they are cheaper
and faster than PET Scans and biomarkers search. Furthermore, technology such as PET Scans
and biomarkers are not globally available. The neuropsychological tests involve simple tasks
such as those concerning orientation, memory, attention and language to evaluate the mental
state of the patient. This work aims to use this data to predict the conversion of Mild Cognitive
Impairment (MCI) to Alzheimer’s Disease (AD). The use of data mining algorithms will allow the
extraction of knowledge or rules from the data in what regards the prediction of MCI to AD.
1.1 Problem Formulation
To study the relation between MCI and AD, researchers typically focus on three related but dis-
tinct problems [7, 18, 25, 27, 36, 50]: (1) distinguishing MCI from AD; (2) predicting the conversion
from MCI to AD; and (3) predicting the time to conversion from MCI to AD. In this work, we tackle
the problem of distinguishing patients with MCI, from those already suffering from AD, using neu-
ropsychological data. This type of data has also been used by other authors [7, 18, 25, 36]. Giving
the increasing difficulty of these three problems and the non-consensual classification of patients
as MCI, distinguishing MCI from AD is an important problem in itself but it gains increasing rele-
vance as a support for the feasibility of effectively tackling the conversion and time to progression
problems.
This work tackles two related problems: (i) distinguishing MCI from AD patients; and (ii) pre-
dicting if a MCI patient will evolve to AD. To achieve this goal, we use state of the art machine
learning techniques, such as SVMs, artificial neural networks, Naıve Bayes, C4.5 Decision trees
and k-nearest neighbour. Furthermore, we investigate the application of techniques that reduce
the influence of missing values in the learning process and that reduce data complexity, therefore
increasing model generalization.
1.2 Contributions
The outcome of this work is the following:
1. Development of a framework for model optimization that includes techniques for: (i) dimen-
sionality reduction, through feature selection; (ii) reducing the effect of class imbalance,
2
1.3 Dissertation outline
through synthetic minority oversampling; and (iii) dealing with missing values.
2. Application of the model optimization framework to the diagnosis problem (overall accuracy
of up to 91%) and to the prognostic problem (overall accuracy of up to 82%, for a 3 years
temporal window).
3. Proposal of a temporal window methodology for the prognostic of MCI patients, with results
showing that it overcomes the traditional approach;
4. Development of an decision support system, based on a webservices infra-structure, for the
deployment of the developed models in the healthcare professionals office.
Part of the results obtained in the course of this thesis were already communicated in:
[13] Lemos et. al, ”Discriminating alzheimer’s disease from mild cognitive impairment using
neuropsychological data”, ACM SIGKDD Workshop on Health Informatics (HI-KDD2012), Beijing,
China, August 2012;
[14] Lemos et. al, ”Predicting conversion from mild cognitive impairment to alzheimer’s dis-
ease using neuropsychological data: Preliminary results”, 26o Meeting of Grupo de Estudo de
Envelhecimento Cerebral e Demencias, Tomar, June 2012;
Furthermore, a new paper is under preparation for communicating the results concerning prog-
nostic prediction of the Alzheimer’s disease, namely the comparison of the temporal windows
approach against the traditional approach.
1.3 Dissertation outline
The work described in this thesis is organized as follows. Chapter 2 describes the background
on Alzheimer’s disease and introduces the basic concepts on data mining algorithms and tech-
niques. Chapter 3 describes the work related to the diagnosis of patients, i.e., regarding the differ-
entiation between MCI and AD patients. For this, a framework for obtaining the model parameters
is described, a comparison between several missing value imputation techniques is performed
and the results for patient diagnosis are presented. Chapter 4 applies a similar strategy to per-
form prognostic prediction of MCI patients. This considers 2 approaches: (i) predicting whether a
patient will ever convert to AD and (ii) predicting if a patient will convert to AD after a fixed period
of time. Results presented in this chapter suggest the latter approach leads to more accurate
models. All these models then integrated into a decision support system, based on a webser-
vices infrastructure, which architecture is briefly described in Chapter 5. Finally the conclusions
are drawn in Chapter 6 where some possibilities regarding future work are also presented.
3
1. Introduction
4
2Background
2.1 Alzheimer’s Disease
By the latest estimates, 25 million people currently suffer from dementia and as a consequence
of population ageing, the number of persons affected by this condition is expected to climb, dou-
bling every 20 years. Complains of cognitive matter are very common in aged individuals and can
be the first sign of on-going neurologically disorders such as AD [37], which is the most common
irreparable, progressive cause of dementia. AD can be described by a gradual loss of memory
and cognitive skills. Every year over 5 million of new cases are reported and the incidence in-
creases with the age of the individual: 1% at the ages of 60, 6% at the ages of 70 and 8% at
ages of 85 or older. These numbers are likely to raise in consequence of the excepted lifetime
increase [2]. The relation between age and AD incidence is evident, making age the most likely
influential risk factor in the diagnosis of AD. In fact, AD incidence raises from 2.8 per thousand
people-year for individuals between 65 and 69 years old to 56.1 per thousand person-year for in-
dividuals older than 90 years old. Nearly 10% of people older than 70 years old have a significant
memory loss, and probable more than half have AD. For people over 85 years old, it is estimated
that 25% to 45% have dementia [2]. It is possible to identify, from the people that have cognitive
complains, those who are in risk to progress to dementia. These are those suffering from MCI .
Since the MCI classification mandate the expression of a cognitive decline greater than excepted
for the person’s age and education level, neuropsychological testing is a fundamental element
in the diagnostic [37]. Currently many efforts are being carried out to investigate AD pathology
and develop appropriate treatment strategies. These strategies have center their attention on the
5
2. Background
long-term conservation of cognitive and functional abilities or slowing down the disease devel-
opment along with reducing behavioural symptoms and maintaining the patient’s quality of life.
Nowadays, there is no treatment leading to the cure or the complete stop of AD progression.
Nonetheless, a current medical objective is the diminution of symptoms that can delay the insti-
tutionalization of the patient, therefore reducing the caregiver costs [49]. It is possible to detect
traces, or biomarkers, of AD in patients with MCI, by the use of magnetic resonance imaging
volumetric studies, neurochemical analysis of the cerebrospinal fluid and Positron Emission To-
mography. These techniques have a higher accuracy and sensitivity that neuropsychological tests
in the progression of MCI patients to dementia, however these methods are expensive, technically
challenging, some invasive, and not widely accessible [37].
2.1.1 Neuropsychological tests
To diagnose, determine the stage, assess and monitor AD, MCI and other dementia, the men-
tal health of a patients is assessed though neuropsychological assessment using a common set of
tests. These tests aim to of identify and quantify cognitive, functional and behavioural symptoms.
A number of test batteries have been developed by medical doctors to assess the mental health of
patients. The more important batteries are: Mini-Mental State Examination (MMSE), Alzheimer’s
Disease Assessment Scale (ADAS) and, in our case the Bateria de Lisboa para Avaliacao de
Demencia (Lisbon Test Battery for Dementia Evaluation in English) (BLAD). Each battery is com-
posed of multiples tests, where some batteries are composed of multiples other batteries.
MMSE is one the most widely used test batteries to perform a brief evaluation of cognitive
status in adults [38] [20] [52]. ADAS was designed to measure the severity of the most important
symptoms of AD. The Alzheimer’s Disease Assessment Scale Cognitive subscale (ADAS-cog) is
the most popular cognitive testing instrument used in clinical trials of nootropics (drugs, functional
foods, supplements, etc, that improve mental functions). It consists of 11 tasks measuring the
disturbances of memory, language, praxis, attention and other cognitive abilities which are often
referred to as the core symptoms of AD [31].
BLAD [6] [52] is a comprehensive neuropsychological battery evaluating multiple cognitive do-
mains and it has been validated for the Portuguese population. This battery includes tests for
the following cognitive domains: attention (Cancellation Task); verbal, motor and graphomotor
initiatives (Verbal Semantic Fluency, Motor Initiative and Graphomotor Initiative); verbal compre-
hension (a modified version of the Token Test); verbal and non-verbal abstraction (Interpreta-
tion of Proverbs and the Raven Progressive Matrices); visual-constructional abilities (Cube Copy)
and executive functions (Clock Draw); calculation (Basic Written Calculation); immediate memory
(Digit Span forward); working memory (Digit Span backward); learning and verbal memory (Verbal
Paired-associate Learning, Logical Memory and Word Recall).
The data used in this work was obtained using BLAD. Each evaluation of a patient corresponds
6
2.2 Classification
to an instance identified by the date of evaluation and the ID of the patient. The majority of
patients have multiple evaluations. Each of them is associated with an evaluation date and a
patient classification, which is one of the following: Normal, Pre-MCI, MCI-frontal, slight MCI,
MCI, amnesic MCI, MCI md, advanced MCI, MCI-a, vascular MCI, slight Dementia, Dementia and
without class. This data was initially pre processed to contain only 4 classes: Normal, Pre-MCI,
MCI and Dementia. The new MCI class is now composed by all MCI subtypes. The instances
without classification have been removed. Many instances contain missing values for a set of
neuropsychological tests. The normal and pre-MCI instances were also discarded.
2.2 Classification
To extract useful knowledge from this data, data mining techniques have to be used. Generally
these are divided into 7 steps [23]:
1. Data Cleaning to remove inconsistent data and outliers, in the case of our problem this is
of great importance and has already been addressed by reporting errors to the medical
doctors.
2. Data integration to combine multiple data from different sources. In this work there was no
need to perform it since the data already was delivered as a single table. This may however
be important in the future if data from other database is integrated (e.g ADNI data base) 1.
3. Data selection (sometimes referred as feature selection) to discover relevant features. This
step is of enormous importance to simplify the data, and minimize the confusion presented
to the classifier.
4. Data Transformation, to transform the data in order to be fit to the classification process. For
now this step is performed This step for now is done automatically by the WEKA software
[22]. In the future this will be done if the new algorithms that will be studied have the need
of this.
5. Data Mining, the process where intelligent methods are applied in order to extract data
patterns. For the problem of the diagnosis and prognosis, a comparative study of various
algorithms was performed.
6. Pattern evaluation, with the purpose of recognizing interesting patterns that represented the
knowledge. This step corresponds to the evaluation the different classifiers and the different
parameters used.
1http://adni.loni.ucla.edu/ last accessed in 13 October 2012
7
2. Background
7. Knowledge presentation, where visualization and knowledge representation techniques are
used to show the knowledge acquired to the user. In our case, this will be performed when
we show our results and rules to the medical doctors.
Classification can be described as a two-stage process [23]. In the first stage, a classifier
describing a set of data classes is built. This stage is designated by learning step or training
phase. In this stage the classification algorithm is ”learning from” a training set composed by
instances of data, which are made up of a n-dimensional attribute vector, X =(x1,...,xn), and a
class label. In this case X a set of attributes extracted from the neuropsychological data and the
class label is the patient mental health given by a medical evaluation and categorized as MCI and
AD. The attributes in vector X is can be numerical or categorical. The instances used to train the
classification algorithm compose the training set. This type of process is known as supervised
learning, since the class label attribute is provided to each X in distinction to the unsupervised
learning algorithms that do not known the class label attribute or the number of classes to be
learned in advance. In our case we can use unsupervised methods to obtain subsets of MCI for
example. In the context of classification, the set of n-dimensional attribute vector that represents
an evaluation of the patient, and the respective class label attribute can be named as instances. In
the second step, the model obtained is used to classify the test set. The test set is a subset of data,
independent from the train set, that is used to measure the accuracy of the classification model.
It should be noticed that in this we only use supervised methods. However, as the future work
unsupervised learning could be used. For example, to decrease the complexity in the classifier,
clustering techniques can be applied to divide the MCI group into sub groups.
k -Nearest-Neighbour Classifiers
The k -Nearest-neighbour (kNN) classification consists on learning by comparison. Suppose
we define a metric to evaluate the distance between two instances, for example, or the Euclidean
distance:
dist(X1, X2) =
√√√√ n∑i=1
(x1i − x2i)2 (2.1)
or the the Manhattan distance:
dist(X1, X2) =
n∑i=0
|x1i − x2i| (2.2)
where X1 = (x11, x12, ..., x1n) and X2 = (x21, x22, ..., x2n). The kNN algorithm works as fol-
lows. For each instance Xi in the test set, find the K nearest instances in the training set
(Xi1, Xi2, ..., XiK). Following classify the instance Xi as belonging to the most common class
within the K nearest neighbours. Typically, the values of each attribute are normalized before
using the Euclidean distance. This prevents the under-weighting of attributes with a smaller range
relatively to attributes with a larger range. To deal with missing values, the classifier assumes the
8
2.2 Classification
highest difference between these two attributes. This can cause mistakes in the classifier. How-
ever adequate pre-processing can overcome limitations of the algorithm, by for example replacing
missing values by the mean value of the attribute [12].
Naıve Bayes
Bayesian classifiers are statistical methods that can forecast the class membership probabilities
using the Bayes theorem in (2.3)
P (H|X) =P (X|H)P (H)
P (X)(2.3)
where H represents some hypothesis, such as belonging to a class Ci and X is the instance.
Studies comparing classification algorithms have discovered that a simple Bayesian classifier
such as Naıve Bayes, can in some cases, be analogous in performance with decision trees and
neural network classifiers [23]. Bayesian classifiers have also demonstrated high accuracy and
speed when applied to large amounts of data. Naıve Bayes classifiers assume that attributes are
independent and work as follows [23]:
1. Assume D to be a training set (n-dimensional attribute vector and respective class label).
2. Suppose that there are m classes, C1, C2, ..., Cm. Given a test instance, X, the classifier
will predict to what class X belongs to, by choosing the class having the highest posterior
probability, P (Ci|X):
∀j 6=iP (Ci|X) > P (Cj |X) (2.4)
The P (Ci|X) is called the maximum posteriori hypothesis.
3. Since P (X) is constant for all classes we only need to maximize P (X|H)P (H). Replacing
hypothesis H by the Class Ci we have P (X|Ci)P (Ci). If the class prior probabilities are
unknown, then we assume that all classes are equally likely, and we maximize P (X|Ci).
4. Since Naıve Bayes classifiers assume that attributes are conditionally independent it results
that
P (X|Ci) =
n∏k=1
P (xk|Ci) = P (x1|Ci) ∗ P (x2|Ci) ∗ ... ∗ P (xn|Ci) (2.5)
5. Estimation of P (Xk|Ci) is performed differently when dealing with categorical and continuous-
valued attributes. For categorical attributes, P (xk|Ci) is the number of instances of class Ci
in training set having the value xk for an attribute k, divided by the number of instances of the
class Ci in training set. For continuous-valued attributes, is the probability density function
must be estimated. A simple approach is to assume that P (xk|Ci) is normally distributed, in
which case:
P (xk|Ci) = g(xk, µCi , σCi) (2.6)
9
2. Background
g(x, µ, σ) =1√2πσ
e−(x−µ)2
2σ2 (2.7)
where µ is the expected value of xk and σ is the standard deviation.
6. To predict the class label of the test instance, X, we use (2.5) for each class and then, by
applying (2.4) we obtain the most probable class.
Decision Trees
A decision tree is a model structure where each non-leaf node has a test on an attribute, each
branch represents an outcome of the test and each leaf has a class label (see Figure 2.1). The
top node is the root node. In Figure 2.1 the root node is the node that tests Attribute 1.
Attribute 1
Class 2Class 1 Attribute 2
Low
Normal
High
Class 2Class 1 Attribute 2
Class 2 Class 1
True False
Figure 2.1: A decision tree that uses 2 attributes: Attribute 1, that is ternary (Low, Normal andHigh), and Attribute 2, that is binary (true or false). Each internal or non-lead node (in blue)represents a test on an attribute. Each leaf node (in orange) represents a class (Class 1 or Class2).
This classifier receives an instance with an unknown class label, the attributes of the instance
are then tested against the decision tree. A path is traced from the root to the leaf node, having
the class label for that instance. In the case of Figure 2.1, an instance X = {Attribute 1 = Normal,
Attribute 2 = False} would be tested first against Attribute 1 node (root node). As Attribute1 =
Normal, would be tested against Attribute 2 and then since Attribute2 = False the tree would
predict that the instance belongs to Class 1. If Attribute 1 was Low or High only that test would be
necessary, and thus Attribute 2 would not be used.
A basic algorithm for the construction of a decision tree receives as input: set of instances, an
attribute list and an attribute selection method. The set of instances at the beginning is the training
set, but since this algorithm is recursive, this set will change along the execution. The attribute list
is a list of attributes that describes the instances. Finally, the attribute selection method specifies
a heuristic procedure for selecting the attribute that better discriminates the instances according
10
2.2 Classification
to the class. This procedure uses an attribute selection measure, such as information gain [48],
gini index [48] or gain ratio [3]. The tree begins as a single node, N , representing the training set.
If all the instances have the same class label then the node becomes a leaf, is annotated with
that class, and the algorithm ends. If not, the Attribute selection method is called to determinate
the splitting criterion. This method will determinate the best way to partition the instances into
individual classes. The splitting criterion indicates which branches should be grown from N with
respect to the outcomes of the chosen test. The node N is then labelled with the splitting criterion,
which is the test at the node. A branch is grown from N for each of the results of the splinting
criterion. The instances are divided according to the results. The splitting attribute will have one of
two possible scenarios: discrete-values or continuous-values. If the splitting attribute is discrete-
value, the results of the test at N correspond directly to the known value of the attribute. A branch
is created for all values of the attribute. In this case this attribute is removed from the attribute
list, since this attribute will not be considered in any future instance. If the splitting attribute is
continuous-valued, N has two results Attribute ≤ splittingpoint and Attribute > splittingpoint.
The splitting point is returned by the attribute selection method and two branches are gowned from
N with the results as labels. The algorithm is recursive, repeating the process for each subset of
the training set created. The possible stop conditions of the algorithm are:
• All the instances in the training set belong to the same class.
• There are no more attributes to split. In this case N is converted to a leaf and labelled with
the most common class in the current training set.
• There are no instances for a given branch. In this case, a leaf is created with the majority
class in the current training set.
Finally, the decision tree is returned.
If the training set has noise or outliers, this will generate branches that reflect this problem.
The tree pruning tries to identify and remove such branches. The most commonly used attribute
selection methods are the following
• Information Gain [23]
The information gain uses the value of information content of messages. The attribute cho-
sen to split is that with the higher information gain. This attribute minimizes the information
needed to classify the instances in the resulting partitions and maximizes the homogene-
ity of the label class in resulting partitions. This approach produces simple trees and re-
duces the number of tests. Let D, be the instance set. The information gain of D, is called
Gain(D), is defined as follows:
Gain(D) = −m∑i=1
pi ∗ log2(pi) (2.8)
11
2. Background
where pi is the probability that a random instance in D, belongs to a classi, and is estimated
by |Ci,D||D| . A base 2 logarithm is used since the information is encoded in bits. Gain(D) is
also known as the entropy of D.
If the attribute is discrete-value v branches will be grown. In the ideal case we want the
partition to have only instances from the same class, but that is rarely achieved. Thus, we
need to know how much more information is needed in order to obtain an exact classification.
We use the next expression for this purpose. Let, Dj be the set of instances in partition j.
GainA(D) =
v∑j=1
|Dj ||D|∗ info(Dj) (2.9)
The term |Dj |D represents the weight of the partition j. GainA(D) is the expected information
required to classify an instance from D based on this partition. Smaller values mean more
homogeneous partitions in relation to the class label.
The information gain is given by the difference between the original information based only
on the proportion of classes and information obtained after the partition.
Gain(A) = Gain(D)−GainA(D) (2.10)
In case of attributes that are continuous-valued, we have to determine the best splinting
point, where the split-point is a threshold in the attribute, after sorting the values of the
attribute in increasing order. In general, we use the midpoint between each pair of values.
The informative gain attribute selection method is used in the ID3 algorithm [47].
• Gain ratio [23]
The C4.5 uses an extension of information gain, called Gain Ratio which is computed as
follow:
GainRatio(A) =Gain(A)
SplitInfo(A)(2.11)
where SplitInfoA represents the potential information generated by splitting the train set.
SplitInfoA(D) = −v∑
j=1
|Dj ||D|∗ log2
(|Dj ||D|
)(2.12)
In this settings, the attribute with the highest value is chosen.
The C4.5 algorithm applies a kind of normalization to information gain using the split infor-
mation value.
• Gini index [23]
The Gini index measures the mistakes in a set of instances by using the following expres-
sion:
GiniD = 1−#classes∑class=i
p2class (2.13)
12
2.2 Classification
where pi represents the probability that an instance of D belongs to a Class Ci and is
estimated as CI,DD . C4.5 uses the Gini index and considers a binary split for each attribute.
In order to compute the attribute with the best binary split, in case of a discrete-value we
have to analyse all of the possible subsets of instances that can be formed using the know
values of the attribute. Each subset can be considered as a binary test for the attribute.
In case of two splits D1 and D2, GiniA is given by:
GiniA(D) =|D1||D|
Gini(|D1|) +|D2||D|
Gini(|D2|) (2.14)
In case of continuous-valued attributes, each possible binary split is analysed. This is a
similar to that of information gain, where each sorted pair of values is taken as possible
split. The point given in the minimum Gini index value is the chosen split point.
The reduction error in the classification occurring by performing a binary split on an attribute
is given by:
4Gini(A) = Gini(D)−GiniA(D) (2.15)
CART uses the Gini index.
Neural Networks
As in [23] a Neural Network is a set of connected input/output units or neurods where each
connection has an associated weight, as shown in Figure 2.2. The neural networks are computa-
tional analogues of neurons. Given a neurode j in a hidden or output layer, the net input, Ij to the
neurode is:
Ij =∑i
wijOi + θj (2.16)
where wij is the weight of the connection from neurode i to neurode j; Oi is the output of the
neurode i from the previous layer; and θj is the bias of the neurode, which allow to vary the
neurode activity. Each neurode in the hidden layer or output layers uses an activation function.
This function is a non-linear and differentiable logical function that allows for the classification of
problems that are linearly inseparables [23].
A neural network is simply a set of neurodes organized in layers. Each layer is composed by
neurodes in the case of the hidden and output layer. The neurodes in the input layer are called
input neurodes. The inputs to the neural network match up to the attributes measured in each
training instance. The input instance is fed simultaneously into the neurodes of the input layer.
These inputs pass the input layer and are weighted and used as input to the second layer, or
hidden layer. The outputs of one hidden layer can be the input of the next hidden layer, and so on.
The weighted outputs of the last hidden layer are the inputs of the output layer neurodes. These
neurodes emit the network’s prediction for a given instance.
A network is called feed-forward if none of the weights cycles back to a hidden neurode or to
an output neurode of a previous layer. The network is fully connected if each neurode provides
13
2. Background
y1
fy2 ∑
Weights
w1j
w2j
Bias
Ɵj
fy2
yn
∑ output
Inputs (outputs from
previous layers)
Wnj
Weighted Sum Activation function
Figure 2.2: The figure shows a neurode of a hidden or output layer. The inputs to the neurode areoutput of the previous layer. These are multiplied by their respective weight to form a weightedsum, which is finally added to the neurode bias. A non-linear activation function is then appliedto the next input. If the hidden layer is the first one then her inputs will correspond to the inputinstance.
an input to all neurodes in the forward layer. It can model the class prediction by a non-linear
combination of inputs, that is, by a non-linear regression. If given enough hidden neurodes and
enough training instances a neural network can closely approximate any function [23].
The design of the network topology is defined by the user, that is, the number of hidden layers,
the number of neurodes in each hidden layer, the number of input and output neurodes. The
choice of these values is usually a trial and error process and may affect the accuracy of the final
model. Neural networks can be used for classification (predict the instance label) or prediction of
a continuous-valued output. For classification, one output neurode may be used to represent two
classes (where 1 represent one and 0 another class). If the problem has more than two classes
then one output neurode is used per class [23].
Backpropagation is the most common neural network learning algorithm. Backpropagation
learns by iteratively comparing the prediction output of the training set with the real value. This
value may be a class label in case of classification or a continuous value in case of prediction. Fol-
lowing the weights (wij , θj)of the network are adjusted to minimize the mean square error (MSE)
between the network’s prediction and the actual target value of the instance2. These adjustments
are made by computing the derivative of the error regarding each weight. The learning process
stops when the weights converge. The Backpropagation algorithm can, in a very simple way be
divided in two phases: the propagation and the weight update. After receiving the input param-
eters, the response of an unit is propagated as input to the neurodes in the next layer until the
2The MSE is the most common metric for the error but there are other metrics.
14
2.2 Classification
Input layer Hidden layer Output layer
Figure 2.3: Multilayer feed-forward neural network. A multilayer feed-forward neural network is aset of layers: one input layer, one or more hidden layers and an output layer.
output layer, where the response is obtained from the network and the error is computed as:
Errj = (Oi − Tj)2 (2.17)
where Oj is the observed output of the neurode j and Tj is the known target value for the given
training instance. Now this is done backwards, from the output layer until the first hidden layer the
synaptic weights are adjusted.
A disadvantage of neural networks, besides generally the long training time, is their poor
interpretability. It is difficult Humans, to interpret the symbolic meaning behind the learned weights
and the hidden units of the network. Advantages of neural networks include high tolerance to
noise in data, their ability to find patterns on which they have not been trained for, their ease
of use when little knowledge on the relation between attributes and classes is known, and their
suitability for continuous-value input/outputs as a contrast to decision trees [23].
Support vector machines
Support vector machines (SVMs) is a method for linear classification of data as showed in
Figure 2.4. Non-linear classification can however be achieved by applying a non-linear kernel to
the data; this transforms the data into a higher dimensional space where linear classification can
be applied. In the original space, this results in non-linear classification. Shortly SVM works as
follows [23]: it uses a non-linear mapping φ() to transform the original training data into a higher
dimension Y = φ(X). In the new dimension it searches for the optimal linear hyperplane that
makes the separation. In SVM sense, the optimal hyperplane (W ) is the one that maximizes the
margin (distance) between two classes (in the transformed space), as shown in figure 2.4.
Once the optimal hyperspace is found the classification between two classes can be achieved
15
2. Background
Figure 2.4: Linearly separable data, can be divided with strait line (between the dashes lines).The showed strait line is the Maximum-margin hyperplane.
by computing:
d(Y ) = sign(W · Y + b) (2.18)
To find the optimal hyperplane, a linear combination of training points can be used:
W =∑
αiCiYi (2.19)
where Ci = {1;−1} indicates the true class of instance Yi = φ(Xi); and αi is a coefficient indi-
cating how difficult is to classify instance Xi. Using (2.19) one can re-write the decision function
as:
d(Y ) = sign(∑
αiCiYi · Y + b)
(2.20)
In the non-linear classification case one can define a non-linear transformation kernelK(X1, X2) =
φ(X1)φ(X2). Typical kernels are:
• linear K(Xi, Xj) = XTi Xj .
• polynomial K(Xi, Xj) = (γXTi Xj + r)d, γ > 0.
• radial basic function (RBF) K(Xi, Xj) = exp(−γ||Xi −Xj||2), γ > 0.
• sigmoid K(Xi, Xj) = tanh(γXTi Xj + r).
16
2.2 Classification
where, γ, r and d are kernel parameters. In practice, since the transformation function φ() is not
always known, the decision is performed directly on the original space through the Kernel function
K():
d(Xj) = sign
(∑i
αiCiK(Xi, Xj) + b
)(2.21)
where b is a parameter computed through the support vectors (SVs), i.e., all instances Xi in the
training set that have a corresponding value αi > 0:
b =1
NSV s
∑i∈SV
αiK(Xi, Xj)− Ci (2.22)
where NSV s is the number of support vectors.
To obtain the optimal hyperplane an optimization function can be derived [23]
maximize
L(α) =∑i
αi −1
2
∑i.j
αiαjCiCjK(Xi, Xj) (2.23)
subject to
∀iαi ≥ 0 (2.24)∑i
αiCi = 0 (2.25)
This can be solved by a linear constraint solver algorithm. For some transformation functions, it
may not be possible to obtain a hyperplane that exactly separates all training instances. In such
case slackness variables (ξ) may be added to the optimization function. Two solutions typically
arise from this formulation:
• Box constrains
W (α) =∑i
αi −1
2
∑ij
αiαjCiCj K(Xi, Xj) (2.26a)
0 ≤ αi ≤ C (2.26b)∑i
αiCi = 0 (2.26c)
• Diagonal
W (α) =∑i
αi −1
2
∑ij
αiαjCiCj K(Xi, Xj)−1
2C
∑αiαj (2.27a)
0 ≤ αi (2.27b)∑i
αiCi ≥ 0 (2.27c)
17
2. Background
Binary SVMs classification is the standard technique. But the multi-class classification is
more complicated. An approach to this multi-class classification problem is to use a combina-
tion of several binary SVM classifiers [16]. Examples strategies are: one-versus-all method using
winner-takes-all strategy (WTA SVM) [16]; one-versus one method implemented by max-wins
voting (MWV SVM) [16]; DAGSVM [44]; and error-correcting codes [15].
2.3 Feature Selection
A central problem in machine learning is to identify the set of features that best represent
the data and can be used to construct a classification model for a particular model [21]. Fea-
ture selection is the process of selecting a subset of features for building robust learning models.
Feature selection is particular important when the dataset has a large number of features. By
removing redundant and irrelevant features from the dataset, the feature selection technique aims
to improve the overall performance of the learning models. This improvement results from the re-
duction on the curse of dimensionality [21], increasing the generalization capability of the learning
models (and decreasing overfitting), increasing the speed of the learning process and improving
the model interpretability since less features are used.
Feature selection algorithms are generally divided in two groups: feature ranking and subset
selection. Feature ranking methods rank the features using a metric and remove all features that
do not achieve a sufficiently high score. Subset selection methods search the set of possible
features for the optimal subset.
The optimal solution for supervised learning is to apply an exhaustive search where all features
subsets are tested and the best one is chosen. This is impractical in cases where the dataset has
a large number of features and instances. For practical learning problems, suboptimal solutions
are usually preferred, since an exhaustive search would take too long to compute.
2.3.1 Correlated based
Hall [21] claims that ”feature selection for classification tasks in machine learning can be
accomplished on the basis of correlation between features, and that such a feature selection
procedure can be beneficial to common machine learning algorithms”. The author compare
Correlatation-based Feature selection (CFS) with wrapper methods, and empirically conclude
that CFS generally outperforms the wrapper methods. The also conclude that CFS is faster, often
in 2 orders of magnitude, than wrapper methods.
2.3.2 Minimum redundancy maximum relevance (mRMR)
Peng et al. [43] proposed a feature selection method based on mutual-information techniques.
The authors study how to select good features according to the maximal statistical dependency
18
2.4 Overcoming class imbalance
criterion based on mutual information. The method effectively combines mRMR with wrapper
methods to achieve a very compact subset of features from candidate features at lower com-
putational expense. The authors conclude experimentally that the use of mRMR based feature
selection can significantly improve the accuracy of the classification models.
2.4 Overcoming class imbalance
Imbalanced learning targets a significant amount of problems of interest by academics, indus-
try and governmental agencies [24]. The main problem of learning from imbalanced data sets is
the fact that this imbalance compromises the performance of most standard learning algorithms.
The majority of learning algorithms assumes a balanced class distribution or equal misclassifi-
cation costs. The typical result is a favouring of the predominant class which gives poor class
predictions. This problem is described in [24] as a way to create a highly accurate model for the
minority class without severely damaging the accuracy of the majority class (note that in many
cases the minority class is the most relevant class). In these problems the evaluation of the
classifier is not easy, since one class is highly under-represented. The conventional evaluation
practice of using singular assessment criteria, such as the overall accuracy or the error rate does
not provide an accurate way to evaluate the model in case of imbalanced datasets. For example,
for a data set with a ratio of 1000:1, if every single instance is classified as the majority class, the
error of classification would be 0.1%. More informative assessments metrics must be used, such
as ROC curves, precision-recall curves and cost curves are needed to evaluate the performance
of a classifier in presence of an imbalanced dataset. This problem has application in different
domains like: biomedical, fraud detection, network intrusion, etc. Imbalances can be character-
ized as intrinsic, that is, the nature of the data space has as result the imbalance of the class
distribution. Variable factors as time and storage can also originate imbalanced data sets. We
call this type of imbalance extrinsic, that is, the imbalance is not directly related with the nature of
the data space, but result from the class distribution. It is of great importance to understand the
difference between relative imbalance and imbalance due to rare instances. For example, if we
have a data set with a ratio of 100:1, in a dataset with 100,000 instances, 1,000 of those would
be of the minority class. If we double the number of instances the distribution does not change,
having now 2,000 instances, the minority class is not necessary rare but small in relation to the
majority class. This is an example of a relative imbalance. Haibo He et al.[24] study concluded
that in certain relative imbalanced data sets, the minority class is accurately learned with little
disturbance from the imbalance. This suggests that the imbalance is not the only factor that dif-
ficult the learning. The data complexity is the primary determining factor of deterioration that is
amplified by the relative imbalance. Data complexity is a term that includes issues like overlap-
ping of classes, lack of representative data, small disjuncts and others. Imbalance due to rare
19
2. Background
instances is representative of domains where the minority class instances are very limited. In this
kind of imbalance the learning process will be more complex due to the lack of representativity of
the minority class. Minority concepts may have also sub-concepts with limited instances, that can
make the classification difficulty. This is a new form of imbalance called with-in class imbalance,
which concerts itself with the distribution of the representative data in the sub-concept inside a
class.
2.4.1 Techniques to deal with imbalanced datasets
The class imbalance in the dataset can damage the quality of the classification. If the minority
class is hard to discriminate, the classifier can simply classify every instance as the majority class.
In this case if the minority class only represented 1% of the data, the obtained accuracy would be
99%, although the classifier is useless in the discriminative task.
To overcome this problem, a set of methods have been developed. These techniques have
the objective of minimizing the effect of using imbalanced data. Examples of this techniques
are: sampling by removing, creating or duplicating instances; and cost sensitive methods, where
different costs to the misclassification of the different classes are assigned.
2.4.1.A Random Sampling
Two types of random sampling can be defined. Random under sampling, selects a small
subset of instances from the majority class. Random oversampling, duplicates instances from
the minority class in order to balance the dataset. Although the former can remove important
instances from the dataset, and the latter can lead to overfitting [9], sampling techniques have
been constantly improving to overcame those issues.
For example, random under sampling can be used to remove instances that are far away from
the decision border (using kNN to define the decision border) [9].
2.4.1.B Synthetic Minority Over-sampling Technique (SMOTE)
SMOTE is an over-sampling approach in which the minority class is over-sampled by creating
a synthetic instance rather than by over-sampling with replacement. The minority class is over-
sampled by taking each minority class sample and introducing synthetics instances along the line
segments joining any or all of the k minority class nearest neighbours. Depending of the amount
of over-sampled needed, neighbours from k nearest neighbours are chosen. For example with
k = 5 and 200% oversampling, only two of the five nearest neighbours are chosen and then one
instance is created in the direction of each chosen nearest neighbours. Synthetic instances are
created in the following way: 1)Take the difference between the feature vector (instance) under
consideration and its nearest neighbour; 2) Multiply this difference by a random number(between
0 and 1), and add it to the feature vector under consideration. This makes the selection of a
20
2.5 Model Validation
random point along the line segment between two specific features. This approach effectively
forces the decision region of the minority class to become more general [10].
2.4.1.C Cost sensitive classifiers
Cost sensitive learning is used to learn when different misclassification errors incur different
penalties. In cost sensitive learning a cost matrix must be defined to give a cost to each clas-
sification case (True Positives, False Positive, False Negatives and True Negatives). Having in
account the cost matrix, the classification model is trained to have the lowest accumulative score
[17]. This cost can have a significance, like economically, temporal, severity of an illness, etc.
Cost sensitive classification can be applied to unbalanced datasets, by giving more value, or less
cost, to correctly classify the minority class. This will give more value to the minority class, and
will increased the discrimination of that class.
2.5 Model Validation
In a specific classification problem one can ask the question on which method is the best.
Using the train set to make the comparison would lead to misleading overoptimistic results due to
an overspecialisation of the learning algorithm to the data. To avoid this problem, the dataset is
divided into train and test. The accuracy of a model is then evaluated by comparing the results on
the test dataset. The confusion matrix is a helpful tool to analyse how the classifier recognizes the
instances of different classes. In the case of more than two classes, multiple n confusion matrices
are generated, one for each class as positive with the respective complement as negative. To
calculate the quality metrics (accuracy, sensitivity, etc.) a weighted average is used. For example,
if we want to calculate the accuracy in a three classes problem, (C1, C2, C3) we compute∑3
i=1
(Accuracy of Ci * the number of instances labelled with Ci) and divided it by the number of total
instances. One may use a matrix with more than 2 classes. In that case, for each metric, we shall
use for example the target class as positive and as the complement the sum of the other classes.
A confusion matrix for a two class problem is shown in Table 2.1. A technique used to evaluate the
quality of the models is cross validation. Cross-validation is a statistical technique that evaluates
how the model can be generalized to an independent dataset. The dataset is also divided in two
sets, the training set and test set. In case of the model have overfitting this technique will show
us that. The basic form of cross-validation is k-fold cross-validation sets and the use of k = 1 to
train an one to test in each of the k iterations. For example, when k=10, the data is divided in ten
portions (nearly equals) one of the portions is used as test set and the remained as training set.
Since the algorithm evaluates every single portion as test set and as training set, this means the
model will be trained and tested 10 times. And in each one a different test set is used. The most
common k in data mining is 10-fold cross-validation. A common way to evaluate classifier when
21
2. Background
Table 2.1: Confusion matrix.C1 C2
C1 Number of true positives (TP) Number of false negatives (FN)C2 Number of false positives (FP) Number of true negatives (TN)
using a small dataset is Leave-One-Out, where k is the number of instances in the dataset.
True positives are the positive instances that are correctly classified while true negatives are
the negative instances correctly classified. False positives are the negative instances that are
incorrectly classified as positive and false negative are positive instances incorrectly classified as
negative.
• Accuracy
Accuracy is measured by the ratio between the numbers of correctly classified instances
and the total number of instances.
Accuracy =TP + TN
TP + TN + FP + FN(2.28)
• Sensitivity (True Positive Rate or Recall) and Specificity (True Negative Rate)
Sensitivity is the proportion of instances which were classified as class x, among all exam-
ples which truly have class x.
Recall = Sensitivity =TP
TP + FN(2.29)
Specificity is the True Negative Rate that is the proportion of instances which were classified
as class x, but belong to a different class, between all instances which are not of class x.
Specificity =TN
TN + FP(2.30)
• Precision
Precision is the probability of obtaining a relevant result among the subset declared as
positives (TP+ FP):
Precision =TP
TP + FP(2.31)
• F-Measure
The F-Measure is the harmonic mean of precision (P) and Recall (R).
F =2PR
P +R(2.32)
• Receiver Operating Characteristic(ROC) curves and |TPR− FPR|
ROC curves are an useful tool to compare two classifier models. The ROC curve show us
the trade-off between the true positive rate (TPR) rate or sensitivity and the false positive
rate (FPR) rate for a given classifier model. The ROC curve is draw in a FPR vs TPR space
22
2.5 Model Validation
(that we designate ROC Space). Each point (FPR, TPR) represents the model using a
different classification threshold. Increasing the classification threshold will result in fewer
false positives (more false negatives), corresponding to the right-left movement of the curve.
The Area under the ROC is a common metric used to compare classifiers. In cases of
probabilistic classifiers (returning the classification probability) this will be an integral of the
area, since we can use the probabilities to variate the threshold and draw the curve. In case
of a non probabilistic classifier, this means a classifier that returns only the predicted class,
the AUC will be simply the TPR+TNR2 or Sensitivity + Specificty
2 , commonly named average
accuracy.
The Area Under the Curve (AUC) is also a good metric for imbalanced learning [24], but is
harder to compute than the |TPR− FPR| in probabilistic models. The AUC will not have in
account the discriminative power of the classifier directly, since a classifier that hasAUC = 0
has more discriminative power than one with AUC = 0.5.
In a ROC space, the line TPR = FPR represents a random classier or a meaningless clas-
sifier. Then a classifier model, m, can be represented in this space as m =(1− Specificity,
Sensitivity). If m is in the point (0, 1) the classifier is perfect, classifying everything correctly.
Or the other hand if m = (1, 0), the classifier is always wrong. In this case by inverting the la-
bel we can get a perfect classifier. In general we are interested in a classification model that
returns a point in the ROC space as far as possible from the random line (TPR = FPR).
This metric will give as the discriminative score of the classifier. Calculating the normalized
Euclidean distance from a point to the random line we get the expression: |TPR−FPR|, or
|Sensitivity − (1 − Specificity)|. This value has a maximum of 1 that represents a perfect
discrimination of the class and a minimum of 0 that represents a random classifier or a clas-
sifier without discriminative power. This metric gives an equal value of classification to both
classes. One advantage of this metric is that it is insensitive to the data imbalacement, in
opposition of the accuracy that overvalues the majority class. The advantage of this metric
versus the analysis of the sensitivity and specificity, is that this discriminative score com-
bines both in a single value, being more simple to analyse. Because of this, the metric is
refereed in the literature as informedness [45].
2.5.1 Overfitting
The overfitting occurs when the model acquires too much detail, or noise from the train set. As
such, the performance of the model in the training instances increases while on the test set (data
never used by the model) the performance becomes worse. In the case of decision trees this
problem can be minimized by pruning the tree, this will transform the model into a more generic
model. Figure 2.5 shows a typical case where the test set error (dashed line) increases after the
23
2. Background
critical point, eventhough the training set error (solid line) maintains a decreasing tendency. In
this point overfitting may have occurred.
Figure 2.5: Overfitting (Error versus Model Complexity). Training Set (solid color line) vs Test Set(dashed line)
Cross-validation
Cross-validation is used to minimize the overfitting problem. This is done since by using 10 train
set instead of only one, this technique can evaluate the model more thoroughly and detect models
that suffers from overfitting. This detection allow us to optimize the model to avoid this.
2.6 Related Work
The problem addressed in this work, predicting the conversion from MCI to AD, was already
studied by different researchers, with different approaches or using different data such as biomark-
ers, Positron emission tomography (PET) scans, among others. A particular case of such work is
the one by Maroco et al [37], which uses a previous version of the data. Used in this thesis.
The work of Clifford Jr el al. [27] addresses the time-to-progression prediction from MCI to AD.
It used 218 subjects with a MCI diagnostic. One or more follow-up assessments were identified
from the Alzheimer’s Disease Neuroimaging Initiative (ADNI). The subjects undergone lumbar
puncture or PIB PET while carrying a MCI diagnostic. This work used a set of methods to pur-
sue their objectives (knowing that the most accepted and validated biomarkers in AD fall in two
categories– imaging and CSF chemical analyses): Magnetic Resonance imaging (MRI) meth-
ods to extract hippocampal volumes and total intra-cranial volumes; Amyloid imaging methods
to extract a global cortical PIB PET retention summary (that combine pre-frontal, orbitofrontal,
parietal, temporal, anterior cingulate and posterior cingulate/precuneus values for each patient);
cerebrospinal fluid methods to quantify biomarker concentrations; and finally statistical methods
to make relations between hippocampal volumes, intracranial volumes, β loads and the presence
of the APOEε4 alleles. Among the 218 patients, with MCI diagnostic, 89 progressed to dementia
24
2.6 Related Work
in an average time of 1.7 years. Important results are that in qualitative terms, age and education
did not change significantly between patients that progress to Dementia and patients that did not.
Although women show a higher occurrence in the progress to Dementia group. Another fact is that
the group that progressed to dementia had a higher proportion of APOE ε4 carriers and a slightly
worse score in MMSE and Clinical Dementia Rating Scale B in comparison with non-progressors
at baseline. The Aβ load and MRI shows that they are highly significant for the progression to AD.
Chapman et al. [8] predicted the conversion from MCI to AD using neuropsychological tests
and neuropsychological tests combined in multivariate ways using essentially two layers of weight-
ing: the weighting applied by the PCA [1] in reorganizing the neuropsychological test measures,
via the correlations among them into the component structure, and the differential weighting for
the component scores added by the discriminate analysis, in computing discriminate coefficients
that are best able to differentiate between the conversion and stable groups. They studied 43 el-
derly patients with a MCI diagnostic. The evaluation of these patients was done by physicians and
met current consensus criteria for the amnestic subtype of MCI. Some tests were performed by
memory-disorders physicians to assist with their diagnoses. These tests included MMSE, a clock
face drawing and the category fluency task. The median time of progression of MCI to a diagno-
sis of AD was of 19.4 months. In this work, all patients that had evidence of stroke, Parkinson’s
disease, HIV/AIDS, reversible dementias, patients medicated with some drugs and patients with
a lower score of a certain threshold in the MMSE test were excluded. The neuropsychological
tests data were normalized to limit the influence of age, education and gender effects. From the
43 patients, 14 were subsequently diagnosed with AD (conversion group) and 29 patients were
not (Staple Group). In the PCA [1] analysis more patients were added: 78 persons with nor-
mal cognition (Control group), 5 with age-associated memory impairment, and 35 MCI patients,
making a total of 216 participants. The PCA was used to develop the component structure from
the neuropsychological test battery. The main advantage of use the PCA was both organizing
similar test measures into components and reducing the number of variables while maintaining
the contributions of all measures. The statistical procedures (from SAS software) used were
MULTTEST, FACTOR, SPEPDISC and DISCRIM. Relevant results obtained were that in general
the conversion group performed worse, in the neuropsychological test, than the stable group,
mainly in retentive memory measures. The PCA analysis on the neuropsychological components
obtained 13 components that have been showed to have a highly discriminatory power in sepa-
rating the AD from the normal ageing process. These 13 components included General Episodic
Memory component, a Generative Fluency component, a Speeded Executive Function compo-
nent, Mood/Activities of Daily Living component, and other components representative of learning
and recognition memory. From the 13 components acquired in PCA only the 11 more important
were used to maintain a 4:1 ratio between the 43 participants and the predictor variables. The 11
components accounted for 72% of the total variance of the data. The PCA component score was
25
2. Background
used to predict the conversion to AD. From these 11 components, six were selected for having the
best discriminatory power, in the stepwise discriminate procedure. They were components are:
Episodic Memory, Speeded Executive Function, Recognition Memory (False Positives), Recogni-
tion Memory (true Positives), Speed in Visuospatial Memory and Visuospatial Episodic Memory.
The results of the prediction: 36/43 patients were correctly classified, accuracy of 83.7%. Of the
14 on the conversion group, 2 were incorrectly predicted to have remained stable, sensitivity of
86%. Of the stable group 24/29 of the patients were correctly predicted, specificity of 83%.
Ewers et al [19] addressed the problem of predicting conversion from MCI to AD dementia
based a biomarkers and neuropsychological test performance. The data used was a dataset of
MRI, CSF and neuropsychological tests. They used as inclusion criteria the score of neuropsy-
chological tests like MMSE, ADAS and Ray auditory verbal learning test, the concrete values are
in [19]. From the CSF, it was extracted the concentration of Aβ1−42, t-tau and ptau181. From the
MRI they got measures of hippocampus volume and entorhinal cortex. From the neuropsycholog-
ical tests scores, the following set of tests were used: Ray Auditory Verbal Learning Test (RAVLT),
tests of frontal lobe functions, trail making test A and B, verbal fluency tests, Boston naming
test and Digit symbol substitution test. They used statistical methods to select some data. All
variables were examined for normal distribution. Some variables were log-transformed to achieve
normal distribution, like some of the measures of the CSF concentration. The technique used was
logistic regression analysis to establish a prediction model for differentiate AD and aged health
control. They created biomarker-only models and they tested if the neuropsychological variables
contributed to the predictive power of the biomarker based model. In the MCI instance, time to
conversion to AD was tested using Cox regression analysis [D’Agostino et al.]. Within the MCI
group 58/130 developed AD in 3.3 years of the clinical follow up, with a mean follow up interval
of 2.3 years. This work created several models: Model with biomarkers and neuropsychological
variables combined with 94.5% of classification accuracy, 93.8% of sensitivity and 95.6% of speci-
ficity; Model with a LRTAA formula (logistic regression model based upon CSF-concentration of
t-tau, Aβ1−42 and number of ApoE ε alleles) combined with neuropsychological tests, this model
has an accuracy of 95.2%, 92.2% of sensitivity and 97.5% of specificity.
Hinrich et al [26] addressed the problem of predicting the progression from MCI to AD. The
data used in this work was taken from ADNI. The data consists of 233 subjects (48 AD, 66 healthy
controls and 119 MCI). The characteristics of the data are: age at baseline, gender, APOE car-
riers, MMSE at baseline, MMSE at 24 months, ADAS at baseline, years of education, geriatric
depression, MR and FDG-PET images. In some cases other biological and neurological data
were available. The model used to analyze the data is based on the multi-kernel learning (MKL)
framework, which allows the addition of an arbitrary number of views of the data in a maximum
margin. The major innovation on MKL is that it learns an optimal combination of kernels matri-
ces while at same time training a classifier. The results obtained by the use of 2-norm MKL are:
26
2.6 Related Work
using imaging modalities 87.6 of accuracy, 78.9 of sensitivity and 93.8 of specificity; using biolog-
ical measures 70.4% of accuracy, 58.1% of sensitivity and 79.4% of specificity; using cognitive
scores 91.2% of accuracy, 89.2% of sensitivity and 92.6% of specificity; using all data combined
92.4% of accuracy, 86.7% of sensitivity and 96.6% of specificity. Combining all data is the best
way to archive the best classification, but using only the cognitive scores show us a very close
classification from the result of all combined data.
Maroco et al [37] is the most important related work, since they used an older version of the
data set used in this work. The problem that this work dealt with is the prediction of evolution
between MCI and AD. Data used consists in a group of 921 elderly non-demented patients with
cognitive complains. Inclusion criterion consists in a diagnostic of MCI and presence in one or
more follow-up neuropsychological assessment or clinical revaluation. Exclusion criterion’s are:
dementia or other disorders that have as effect cognitive impairment, medical treatments inter-
fering with cognitive function and alcohol or illicit drug abuse. In each follow-up the patient was
classified as MCI or AD. The final dataset is composed of 400 patients. The neuropsycholog-
ical predictors used are a subset of tests from Bateria de Lisboa para Avaliacao de Demencia
(Lisbon Test Battery for Dementia Evaluation in English) with a criterion validity of p < 0.1. This
study used 10 classifiers: LDA (Fisher’s Linear Discriminant Analysis) [39], QDA (Quadratic Dis-
criminant Analysis) [39], LR (Linear regression) [41], MLP (MultiLayer Perceptron), SVM (Support
Vector Machines), RBF (Radial Basis Functions), CART (Classification and Regression tree) [5],
CHAID (Chi-squared Automatic Interaction Detector) [29], QUEST (Quick Unbiased Efficient Sta-
tistical Tree) [34], RF (Random Forests) [4]. All classifiers showed a performance better than
chance in the prediction of conversion to dementia of patients with MCI . There was no statis-
tical difference between 8 of the 10 classifiers in the total accuracy, but RF (Mean=0.74) and
SVM (Mean=0.76) performance significantly better. However, they obtained a poor performance
in terms of sensitivity. From the poor performance and the required good sensitivity in this type of
problem, conversion into dementia, the LR, neural networks, SVM and CHAID trees are inappro-
priate for this type of binary classification task. Having in consideration the accuracy, specificity
and sensitivity the Fisher’s Linear Discriminant Analysis does not rank much lower than those
computer intensive methods like MLP or RF. In conclusion this work shows that for this problem
RF and LDA have higher accuracy, sensitivity, specificity and discriminant power on opposite of
SVM, Neural Networks and classification trees.
Table 2.2 synthesizes the related work referred in this chapter together with the information
about the test performed.
27
2. Background
Table 2.2: Synthesis of the related work.
Problem Data MethodsClifford Jr. et al Time to progres-
sionMRI , PIB PETand Neuropsycho-logical tests
Statistical methods
Ewers et al Time to progres-sion and Prognos-tic
Neuropsychologicaltests and Biomark-ers
Statistical methods, Logical re-gression analysis
Chapman et al Prognostic Neuropsychologicaltests
PCA, Statistical methods
Hinrich et al Prognostic Neuropsychologicaltests, FDG PETand MR images
MKL (Multi-Kernel Learning)
Maroco et al Prognostic Neuropsychologicaltests
Fisher’s Linear Discriminant,Quadratic Discriminant Anal-ysis, Linear Regression, MLP,SVM, Radial basis Functions,CART, CHAID, QUEST andRandom Forests
2.7 Summary
In this chapter we described the Alzheimer’s Disease, the tests used to evaluate a patient, in
particular the neuropsychological tests and we analyse the related work. This tests are divided in a
series of battery tests that assess and monitor the patients mental health. The neuropsychological
tests have the several advantages from the biomarkers for example, since are cheaper, more
widely applied and non invasive. But the use of this test also have disadvantage, like the human
factor to evaluate subjective tests. The lack of quality assurance of the data. This problems make
us take more attention in the results analysis, since this data have the tendency by great number
of missing values and imbalance to overfit the models. In this chapter a variate of methods is
described to train and test the models.
28
3Differentiating MCI from AD
(Diagnosis)
Correctly differencing MCI from AD is a key step in the process of predict the conversion from
MCI to AD. In this context, this chapter addresses the diagnosis problem. Two feature selection
techniques are used, techniques for overcoming missing values are studied, and the methodology
described in this chapter is applied.
3.1 Formulation and Methodology
In this chapter the problem formulation and methodology is defined. With this we mean the
formulation of the problems that we aim to solve, in this work, and by methodology the methods
that we use to solve the formulated problems.
3.1.1 Data Description
The Cognitive Complaints Cohort [35, 37] is a prospective study conducted at the Institute of
Molecular Medicine (IMM), Lisbon, to investigate the cognitive stability or evolution to dementia of
subjects with cognitive complaints based on a comprehensive neuropsychological evaluation and
other biomarkers. The criteria for inclusion, exclusion, and diagnosis of the participants as MCI or
AD during follow-up are described in detail in Dina et al. [51]. In this work, we used a revised and
augmented version of this dataset and considered only neuropsychological data.
The original dataset (Table 3.1) has 1641 instances consisting of individual evaluations of 950
distinct patients during their follow-up at IMM. In each evaluation, each patient was classified
29
3. Differentiating MCI from AD (Diagnosis)
by the medical doctors as Normal, preMCI, MCI and AD using clinical criteria. Only instances
labelled as MCI and AD were considered, and from these, only those concerning patients with at
least two evaluations were analysed, since instances corresponding to patients without follow-up
are more likely to be misclassified 1. All instances with a percentage of missing values of at least
90% were removed since these instances have little information. This yielded a dataset with 677
instances labelled as either MCI or AD, where each instance corresponds to a different evaluation
of a set of 337 distinct patients. We note that, since we aim to distinguish between MCI and
AD patients (or in the prognosis the evolution of MCI to AD), we can consider each evaluation
of a patient as a different instance, meaning that we can learn from patients at different disease
stages that are always diagnosed as MCI during follow-up and patients that convert to AD during
follow-up.
After excluding non-informative features, such as ”Patient ID”, and features related with pa-
tient clinical history, such as ”Follow-up Time”, the analysed dataset (details in Table 3.2) is com-
posed of 677 instances described by 157 features/attributes, which can be numerical, categori-
cal/nominal or ordinal. This dataset is highly imbalanced, in the original classes, since approx-
imately 86% of the instances are labelled as MCI. Moreover, missing values, which are around
50% in the overall data, are still an issue as we discuss below. Figure 3.1 presents the histogram
of the missing values per feature, where it can be observed that 60% of all features have more
than 40% of missing values.
Table 3.1: Original Dataset detailsNormal Pre-MCI MCI AD
Group Size (%) 280(17,1%) 63(3,8%) 1147(69,9%) 151(9,2%)Age (M±SD) 64,6±1,1 64,9±10 70,1±10,7 73,6±13,1Sex (Female/Male) 201/78 32/31 679/468 98/52Schooling Years(M±SD) 10±0,2 11,2±5 8,5±4,9 8,7±5,2
Table 3.2: Dataset details after pre-processingMCI AD
Group Size (%) 583 (86.1%) 94 (13.9%)Age (M±SD) 70 ± 8.4 73.3 ± 8.2Sex (Female/Male) 352/231 61/33Schooling Years (M±SD) 8.7 ± 4.9 8.5 ± 5.2
3.1.2 Problems
With the previously discussed data we can formulate three problems: the diagnosis, the prog-
nosis, and the time to conversion. From these only the time to conversion is not discussed in
1It is less likely for patients with follow up instances in the classification frontier to be misclassified with the wrong label.
30
3.1 Formulation and Methodology
!
"
#!
#"
$!
$"
%!
&'!(')'"('&
&'"(')'#!('&
&'#!(')'#"('&
&'#"(')'$!('&
&'$!(')'$"('&
&'$"(')'%!('&
&'%!(')'%"('&
&'%"(')'*!('&
&'*!(')'*"('&
&'*"(')'"!('&
&'"!(')'""('&
&'""(')'+!('&
&'+!(')'+"('&
&'+"(')',!('&
&',!(')',"('&
&',"(')'-!('&
&'-!(')'-"('&
&'-"(')'.!('&
&'.!(')'."('&
&'."(')'#!!('/
!"#$%&'()')%*+"&%,'
-%&.%/+*0%'()'#1,,1/0'2*3"%,'
Figure 3.1: Histogram of missing values per feature in the original dataset.
this work. The diagnosis is the problem of identifying the disease stage. In our case we aim at
differentiating MCI from AD patients. The MCI class is considered the positive class. This differ-
entiation will be done using neuropsychological data obtained on clinical evaluations, and using
the medical appreciation of the patient in that specific point in time. The relevance of this problem
is to help the medical doctors identifying the disease stage. This problem has an important role,
since the diagnosis is the first step of the medical evaluation of the patient. A correct diagnosis will
help the medical doctor to rapidly adjust the patient care, increasing the life quality of the patient.
The prognosis consists in predicting if a patient will evolve from MCI to AD. For this we use
neuropsychological tests, performed by medical doctors in clinical evaluations. In this problem
the patient must have at least a follow up. The problem of determining if a patient will evolve to
AD is of great importance to the medical doctors since then they can do a preventive treatment to
grant the patient more quality of life and even increase the life expectancy. This problem, which
is discussed in Chapter 5, is more difficult than the diagnosis, since we want to predict the future
evolution of the patient, and each patient is an unique human being with different responses to
the disease.
3.1.3 Methodology
A single data mining methodology can be used used for all problems. In this methodology
we include six classifiers: Naıve Bayes, Gaussian SVM, polynomial SVM, k-nearest neighbour,
C4.5 Decision trees and Artificial neural networks using backpropagation. All classifiers used are
implemented in WEKA.
The imbalance of the data is tackled with a synthetic oversampling technique (SMOTE) [10].
The classifier parameters and the percentage of oversampling is determined using 10-fold cross
validation on a grid search approach. The percentage of oversampling and parameters are com-
bined since SMOTE changes the dataset, and thus so the parameters founded with different
31
3. Differentiating MCI from AD (Diagnosis)
SMOTE percentages may not be the same. The SMOTE algorithm is implemented in WEKA.
The high number of features with a high correlation is tacked using feature selection. The fea-
ture selection reduces the effect of the dimensionality curve, increases the discriminating power,
increases the model generalization and reduces the effect of missing data. For the feature se-
lection we applied two techniques correlation-based feature selection [21] and mRMR (Minimum-
redundancy-maximum-relevance) [43]. The correlation-based feature selection is implemented in
WEKA. The mRMR was implemented in Matlab by a team member in the NEUROCLINOMICS
project.
The missing data problem in this dataset is a big problem, since we have approximately 50% of
missing values. The effect is reduced by the feature selection, since it removes highly incomplete
features. But this problem is also dealt within the classifiers in different ways. Some methods have
internally a way to deal with missing data, SVMs uses the median/mode imputation, decisions
tree like C4.5 algorithm use statistical methods that minimize the missing data effects, Neural
Networks turn off the input neurode in case of missing value and kNN assumes the maximum
possible distance in missing data cases.
Model
Each classifier has some type of parameter or a set of them. The best found SMOTE percentage
can change for each problem, subset of features, for the classifier parameter or classifier itself.
The best classifier parameters and SMOTE percentage must be determined having all this in
account. Thus we cross all tested SMOTE percentages with all tested parameter sets in a grid
search. This search is done for each feature selection method applied with a systematic process
to evaluate the parameters and test SMOTE percentages on a defined space. Therefore, an
automated tool was created to deal with this necessity, avoiding mistakes, and in order to optimize
the grid search process. The metric used to compare the models for each parameter set and
SMOTE is the |TPR − FPR|. This metric is obtained by calculating the normalized Euclidean
distance from the point (FPR,TPR) from the random line (FPR=TPR).
The grid search is performed in all classifier models and in all datasets (with and without
feature selection methods); the parameters intervals are showed in Table 3.3. The tested SMOTE
percentage consists in 11 steps from 0% (no oversampling) to the inversion of the imbalanced. All
parameter sets and SMOTE percentages are tested with 10-cross fold cross validation. After the
search six best triples are found {Classifier, < Parameters >, SMOTE percentage} one for
each classifier in a specific dataset with a feature selection method. This triple is tested again in
30 repetitions, using a different seed in the 10-cross validation for each repetition. This will allows
us to perform statistical analysis to the results. The parametrized model is used to find the best
parameters set and SMOTE percentage to all classifiers in each dataset, with and without feature
selection.
32
3.1 Formulation and Methodology
The SMOTE technique is only applied inside the cross-validation and only to the training set.
If this supervised technique is applied outside the cross-validation or to the full dataset this would
result in over optimist results. In this case we would use synthetic instances to test the model
which would be created from the training set. The feature selection is performed outside the
cross-validation. This is done only due of the need to provide to the medical doctors the selected
set of features used, and because using this technique inside the cross-validation would give us
ten feature sets. Since the feature selection is done outside the cross-validation the results that
use different feature selection methods are not directly comparable, because of the bias given
to the results by using a supervised method to all data. Making a direct comparison would be
misguiding, and would not represent the response of the model to new instances. To tackle this
problem, a final testing model was created, which is described below.
Table 3.3: Grid search parameter intervals.Classifier Parameters
Naıve Bayes Gaussian or SupervisedDiscrimination or KernelRBF SVM Complexity ∈ [1, 10] and γ ∈ [10−5, 102]Poly SVM Complexity ∈ [1, 10] and Degree ∈ [0.5, 5.0]C4.5 DT Confidence ∈ [0.05, 0.5]
ANN Time ∈ [1000, 2000], LearningRate ∈ [0.1, 0.4] andMomentum ∈ [0.1, 0.3]
kNN k ∈ [1, 10]
Data FeatureSelection
Cross-validation
Classifier(model)
Testingset
Training set
Results
Generation of 10 non-overlapping folds
(model training)
(model evaluation)
SMOTESynthetic
oversampling
Figure 3.2: Data Flow used in the parameter grid search for finding the classifiers parameters.The SMOTE percentage is tested with 11 different values for each parameter combination.
Testing Model
For testing the obtained classification model a different data set is used, which was obtained
by splitting the original dataset in 75% of patients for training and 25% of patients for testing. For
this we apply stratification based on: (i) number of evaluations; (ii) age; (iii) sex; (iv) schooling
years and (v) class. Spiting of patients was therefore made such as the distribution of the above
variables is kept constant in the training and testing datasets.
This allows the test set to be used in all problems; diagnosis, prognosis and any future problem
tacked in the NEUROCLINOMICS project. It should be noticed that the test set will not be used
33
3. Differentiating MCI from AD (Diagnosis)
Parameters
Classifier (model)Test set
Training set
Results
SMOTESynthetic
oversampling
Figure 3.3: Data Flow used simulate the real world results.
to find the best parameter set. It is only used to evaluate the final models created using the
train set. These models only use the train set to find the best features and the best parameters
for that specific data set. This will allow us to analyse the behaviour of trained models in a ”real
world” simulation, since the model has never been in contact with any instance of the test patients.
Note also that the features and the parameters are only selected using 75% of the data, avoiding
overfitting to the feature selection and parameters grid search. This overfitting would put in cause
the generalization of the model.
Table 3.4: Details of the train set.Normal Pre-MCI MCI AD
Group Size (%) 203(16,6%) 42(3,4%) 856(69,8%) 125(10,2%)Age (M±SD) 65±0,9 63,8±10,1 69,9±10,1 73,4±14Sex (Female/Male) 149/53 21/21 490/366 79/45Schooling Years (M±SD) 10,1±0,2 12,5±4,6 8,7±4,9 8,9±5,3
Table 3.5: Details of the test set.Normal Pre-MCI MCI DEM
Group Size (%) 77(18,6%) 21(5,1%) 291(70,1%) 26(6,3%)Age (M±SD) 63,5±0,6 67,1±9,7 70,8±12,4 74,7±6,8Sex (Female/Male) 52/25 11/10 189/102 19/7Schooling Years (M±SD) 9,7±0,1 8,5±4,9 8±4,8 8±5
Tables 3.4 and 3.5 detail the obtained train and test set. The stratification was done having
in consideration the number of evaluation of the patient in the raw dataset. As can be concluded
from the tables, the distribution of instances in the two sets is almost similar.
3.2 Missing values
In this section, the aim is to study the missing values in a more systematic way, to find out how
missing values impact the classification results. For this we test a variety of strategies to deal with
missing values, such as: use median/mode imputation, use median/mode imputation only from
34
3.2 Missing values
the previously patients instances, use linear regression with the other patients evolution for the
imputation of missing values, use of a single value to describe a missing value.
Missing Minimization We use two strategies to reduce the number of missing values. In the
first strategy we use the average value in that feature of a patient’s other evaluation, or in the
second strategy the linear regression, to determine a value. These strategies will not remove
every missing value but will reduce their number significantly.
Random Assumption In the majority of work done in missing value analysis [12] the assump-
tion of random occurrence is made. In our data we know that this assumption is probably in part
fallacious, a doctor may not perform some test if the patients have a low score in other test, if the
patient is simply too tired, for time restrictions and so long. However, the main techniques were
tested, to observe and test, if this assumption can improve the overall classification, in a set of
classifiers.
Non-random Assumption Now we use the assumption that the missing values do not appear at
random. In fact the existence of a missing value may have a discriminative power. The techniques
to minimize missing values are not used, since now the assumption is that the missing values
are purposeful. For this study all features are now nominals and in some experiments use a
discriminated dataset, created using a supervised algorithm [12]. To study this assumption we use
the value ”MISSING” and discretize the data and then we analyse if there was some discriminative
power improvement.
3.2.1 Experimental Setup
Using the knowledge acquired previously such as the best feature selection technique, and
the Oversampling percentage of the minority class (SMOTE), this setup uses four classification
techniques: a linear SVM, RBF SVM, a C4.5 Decision tree and Naıve Bayes.
The configuration is:
• C4.5 Decision Tree [46] with 0.25 confidence factor.
• Naıve Bayes classifier [28], assuming that each feature probability density function (pdf)
follows a Gaussian function.
• Support Vector Machine (SVM)[30] using either a linear or a radial basis function (RBF)
kernel (γ = 0.001). The complexity parameter that defines the maximum weight of the
support vectors, is for the linear kernel C = 1 and for RBF kernel C = 2.
35
3. Differentiating MCI from AD (Diagnosis)
As feature selection the subset covariance method [21] is used. In the random assumption
test we combine the dataset resulted by using the average in the patients evaluation’s or the lin-
ear regression in the patients evaluation’s with two techniques to remove the remaining missing
values. These techniques are implemented in WEKA, the replace of missing values with median
or mode, and the Expectation Maximization imputation. In the non random assumption, the miss-
ing values where replace by the string ”MISSING”. The numerical features now are nominals, for
this we used also a supervised discrimination algorithm to find the best discretization.
3.2.2 Results
0.3
0.4
0.5
0.6
0.7
NB SVM Linear SVM RBF C4.5 DT
Random Assumption
Original Median/Mode EM
AVG AVG + Median/Mode AVG + EM
LR LR + Median/Mode LR + EM
|TP
R-F
PR
|
Figure 3.4: The influence of replace missing values with different techniques, using the metric|TPR − FPR|. The algorithms used for the classification are NB (Naıve Bayes), SVM Linear,SVM RBF(Gaussian) and a C4.5 DT (Decision tree). The imputation techniques are Medianand mode, in numerical and categorical respectively and expectation maximization. The missingminimization techniques are AVG (average) and LR (linear regression) both computed using otherevaluation of the patients.
A comparative study was made to access the influence of this assumption on replacing the
missing values. In Figure 3.4 and 3.5, the comparative study was made using |TPR−FPR| met-
ric, since this metric shows a trade-off between sensitivity e specificity. By analysing the results
the best technique to deal with the missing values, as expect, is dependent of the classification
algorithm.
The best result to each method are:
• In Naıve Bayes, with non random assumption, the discretized dataset shows a slightly better
result that the original dataset, note that this dataset have the missing. But all datasets have
a satisfactory result.
• In Linear SVM, with random assumption using the original dataset yields also the best
result. In this case using the linear regression to minimize the missing values has a similar
36
3.2 Missing values
0.3
0.4
0.5
0.6
0.7
NB SVM Linear SVM RBF C4.5 DT
Non Random Assumption
Original Unique value Discretized Discretized + Unique value
|TP
R-F
PR
|
Figure 3.5: The influence of replace missing values with different techniques, using the metric|TPR − FPR|. The algorithms used for the classification are NB (Naıve Bayes), SVM Linear,SVM RBF(Gaussian) and a C4.5 DT (Decision tree). Using an unique value to represent missingvalues. And using also a discretized data set.
result.
• In RBF SVM, with random assumption, the use of linear regression with median/mode or
average with median/mode is marginality better that using the original data set.
• In C4.5 Decision Tree the effects of dealing with the missing values are more evident. The
best technique is, using the random assumption, the use of median/mode to replace all
missing values. Note that in this classifier the datasets without missing minimization are by
far the best.
• To kNN and Neural Networks the results are not showed, but in both cases the random
assumption with the use of median/mode imputation have the best results.
3.2.3 Conclusions
The techniques used that increase the discriminative power, have been obtained using the
randomness assumption. This does not imply that the missing values are random, only that in
the performed experiments the techniques used with the randomness assumption have better
results. Maybe by using other techniques, and using neuropsycholocical domain information in
the classifiers the assumption of non-randomness would have the expected results. Nevertheless
the gain in discriminative power is not significant, this means the results obtained using the default
way of dealing with the missing values in each classifier are almost similar and in some cases even
better.
37
3. Differentiating MCI from AD (Diagnosis)
3.3 Results and Discussion
For the diagnosis problem six triples, for each feature set, have been selected. These triples
are the classifier, parameters set and the SMOTE percentage. The triples are shown in Table 3.6.
These results are obtained using a grid search using only the train set. The box plots with the
classification results in the training set are showed in Figure 3.6.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
KNN
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
NeuralNetwork
Decision Tree C4.5
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Naive Bayes SVM RBF SVM Poly
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
|TP
R-F
PR
|
|TP
R-F
PR
|
|TP
R-F
PR
|
|TP
R-F
PR
|
|TP
R-F
PR
|
|TP
R-F
PR
|
Figure 3.6: Train results of Diagnosis for 3 features sets.
Additionally, for each of classifiers the following method is used to deal with missing values:
• In Naıve Bayes the internal way of ignoring the missing data is used.
• In SVMs the internal way of median/mode imputation is used.
• In kNN the median/mode imputation is used, and the internal way of using the maximal
possible distance in missing cases is never used.
• In C4.5 Decision Tree the median/mode imputation is used.
• In Neural Networks the median/mode imputation is used.
Table 3.6 presents the best set of parameters when using the grid search. It should be noticed
that, to balance the two classes a synthetic oversampling of 600% would have to be applied.
Naıve Bayes
The Naıve Bayes, by design, is insensitive to class imbalance since the probabilities of belonging
38
3.3 Results and Discussion
Table 3.6: Diagnosis Parameters. Using the training set. The number of features selected are forAll Features 153, Correlation 32 and mRMR 22. In this set the missing values are for All Features45%, Correlation 13% and mRMR 29%
Classifier Feature Selection SMOTE Parameters
Naıve BayesAll Features 0% Supervised DiscritizationCorrelation 0% KernelmRMR 1270% Kernel
SVM RBFAll Features 635% Compl = 2.5 and γ = 0.01Correlation 381% Compl = 4.0 and γ = 0.01mRMR 508% Compl = 2.0 and γ = 0.01
SVM PolyAll Features 1143% Compl = 1.5 and Exp = 1Correlation 1270% Compl = 0.5 and Exp = 4mRMR 1143% Compl = 0.5 and Exp = 3
Neural NetworkAll Features 0% l = 0.3 , m = 0.2 and time = 2000Correlation 0% l = 0.3 , m = 0.1 and time = 2000mRMR 0% l = 0.3 , m = 0.1 and time = 1000
Decision Tree C4.5All Features 508% Conf = 0.05Correlation 1143% Conf = 0.05mRMR 1016% Conf = 0.05
kNNAll Features 508% k = 9Correlation 635% k = 5mRMR 1143% k = 8
to a class are calculated using only that class and then the class that is more probable is chosen.
In this case, oversampling is only used with the mRMR feature set. This allows the classifier to
overcome some data confusion near the decision frontier. The way to deal with the numerical
values for the full set is by using supervised discritization, and for the other feature sets the best
way do deal with the numerical values is by applying kernel density estimation of the probability
density function.
SVM
The Gaussian SVM (SVM RBF) is considerably sensitive to class imbalance. For this reason, grid
search always chooses a synthetic oversampling percentage that nearly balances the classes.
Remarkably, for the polynomial SVM, the grid search leads to an inversion of the class distribution.
However, on average, considering only the training set, the results using a Gaussian kernel are
better than the results using a polynomial kernel.
In the polynomial SVM (SVM Poly) the SMOTE applied inverts the imbalance of the data in
all three cases. The AD class is now overrepresented which shows that this model prefers over-
represented AD class. This will, as side effect, reduce the confusion on the class borders, since
it is now overpopulated with AD instances. Overpopulating the AD instances will increase the
probability to correctly classify the AD instances at the border, consequently the MCI instances in
this border will have higher misclassification. This increase of misclassification is more acceptable
39
3. Differentiating MCI from AD (Diagnosis)
that the AD misclassification, since we have in the dataset more MCI instances than AD instances.
The complexity found is relativity small, between 0.5 and 1.5. The polynomial degree with more
features, in the original case, only needs a degree of 1, but the smaller feature sets uses a higher
degree.
Neural Networks
For the Artificial Neural Networks (Neural Networks) case, it can be observed that SMOTE is
always kept at 0%. This shows that artificial neural networks, in all feature sets tested, are not
sensible to the class imbalance. Nevertheless, the median |TPR − FPR| is generally worse in
all feature sets. For the diagnosis case, and having in account the mean |TPR − FPR|, Neural
Networks are the worst model tested.
Decision Tree
In the Decision tree C4.5, the selected SMOTE percentage using all features almost leads to the
balanced state, but for the correlation and mRMR the need for oversampling inverts the balance
of the classes. This shows that for the reduced feature sets, this classifier prefers to have the
AD class overrepresented. The confidence chosen is 0.05 in all features sets. Lowering the
confidence will lower the pruning of the tree. The confidence selected show us that the model will
have little confidence in the dataset.
kNN
k-Nearest Neighbour (kNN) is also sensitive to class imbalance. Thus, the selected SMOTE tend
to the balanced state, except in mRMR case. In the mRMR the best SMOTE case inverted the
data balance. Again this can be explained by the need to define the classification frontiers by
overpopulating it with instances of the least represented class. The number of chosen neighbours
for the full set and mRMR set is large, with 8 and 9, respectively. Again this shows us confusion
in the classification. For the correlated set, the number of neighbours is 5, this shows us a less
confused dataset.
Statistically we can compare the models, using the training set, that use the same features
set. For this analysis we use paired t-test, that we applied an all vs all. The t-test are only applied
if ANOVA test with a 95% confidence confirms the existence of a significant difference.
Using a paired test all vs all approach, in each feature set:
• Original (All Features)
The SVM RBF in all t-test, with a 95% confidence, show us a significantly difference, being
the best one in all cases. The Decision tree is in all cases the worse model, with a 95%
confidence.
40
3.3 Results and Discussion
• Correlation
For the correlation-based feature set, the Naıve Bayes and SVM RBF are the best mod-
els, having no statistical difference between themselves at a 95% confidence. The Neural
networks is the worst model.
• mRMR
For the mRMR feature set, the SVM RBF have a significant difference to all others models
and in all cases it is the best one (at a 95% confidence interval). Again the Neural networks
are the worst model.
Now using the test set that was never seen before, it is possible to evaluate the models be-
haviour in a ”real world” environment. Using this results we can compare models that use different
features, since the features were found without the test set help. Figure 3.7 shows the test results;
the scale of the plot is |TPR− FPR|.
By looking at the test results, we can see that 3 out of the 6 classification algorithms have, for
one feature set, a result |TPR−FPR| > 0.6. These classifiers are: Naıve Bayes with Correlation,
Neural Networks with correlation and kNN with all features. In the SVMs all results using the
different features sets are almost similar (around |TPR−FPR| ≈ 0.5). For the C4.5 Decision tree
the results in the different features sets vary significantly. Results with the original feature set are
considerably worse than for other models (|TPR − FPR| ≈ 0.2). For the kNN, best results are
achieved using all features.
Analysing the results we can see that for each classifier method, the best features can change,
which means that a single feature set is not always the best feature set for all cases, but varies
from model to model. In Table 3.3 we can see for the best classifiers the most common metrics
and the |TPR− FPR|.
Using the train set, the higher median value is |TPR − FPR| ≈ 0.6 using SVM RBF with
correlated-based feature selection (the results with others feature sets are almost similar). How-
ever when using the test set the model now has a |TPR − FPR| ≈ 0.5. The maximum value,
obtained in all algorithms, is |TPR − FPR| ≈ 0.6. This maximum appears in 2 models that use
correlation feature set, the Naıve Bayes and Neural Network. In this case, we can compare the
result, using the train set and using the test set (that simulate the real world with a fully indepen-
dent sample) to analyse the consequence of only using the train set to pick the best model. The
SVM RBF with correlation-based feature selection appears to have the best results. However,
generalization is not so good since in contact with unknown instances its results drops.
Now, we compare the other metric such as accuracy, sensitivity, specificity and Area under the
ROC curve in the test set. By analysing the accuracy results we can see two values above 90%,
the Naıve Bayes and Neural networks. Note that those models have a high |TPR − FPR| score
of about 0.62. But we can see that kNN that also has 0.62 in |TPR − FPR| has an accuracy
only of 78%. Thus by using only accuracy this model would be considered inferior and dismissed.
41
3. Differentiating MCI from AD (Diagnosis)
The trade-off sensitivity and specificity are used to calculate |TPR − FPR|. We can see that
Neural Network and Naıve Bayes have a higher score in sensitivity. And that kNN has the higher
specificity but one of the lower sensitivity. Using the AUC (ROC area) we can see that the three
top scores are also the three top scores of the |TPR− FPR|. AUC metric is also sensitive to the
imbalanced data. Is not easy to choose the best model, however the model that has the higher
|TPR− FPR| and area under the ROC is the neural network.
In table 3.8 the result confusion matrix of the neural network model is showed, and we can
see that the majority of the classes are correctly classified.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
KNN
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
NeuralNetwork
Decision Tree C4.5
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Naive Bayes SVM RBF SVM Poly
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
|TP
R-F
PR
|
|TP
R-F
PR
|
|TP
R-F
PR
|
|TP
R-F
PR
|
|TP
R-F
PR
|
|TP
R-F
PR
|
Figure 3.7: Results for the diagnosis using an independent test set, the scale is |TPR − FPR|where higher is better.
Table 3.7: Best Classifiers for the Diagnosis problem, using the test set.Classifier FS Accuracy Sensitivity Specificity ROC Area |TPR− FPR|
Naıve Bayes Correlation 91% 93% 69% 0.85 0.62SVM RBF Correlation 76% 76% 76% 0.77 0.53SVM Poly Original 78% 79% 69% 0.74 0.48
NN Correlation 92% 94% 69% 0.91 0.62DT C4.5 Correlation 76% 76% 69% 0.76 0.45
kNN Original 78% 78% 85% 0.86 0.62
3.4 Summary
The missing values in the database have been taken in consideration. This problem was
analysed using two assumptions: a random distribution and a non-random distribution. A set of
42
3.4 Summary
Table 3.8: Confusion matrix for the diagnosis problem, using the test set. The algorithm used isArtificial Neural Network with correlation-based feature set.
Predicted MCI Predicted AD
Real MCI 134 9Real AD 4 9
experiences have been performed: using linear regression, expectation maximization, median
mode imputation, and a single value. The conclusion were that the missing are random, and the
results do not significantly improve by using those techniques. Other approaches can be taken
to deal with the missing data like using unsupervised learning to impute missing values on the
clusters. But in this work those approaches were not tested.
To tackle the need of knowing the behaviour of the created models in a real environment, that
is, to the behaviour of the models in contact with new instances, a completely independent test
set was created with the raw dataset. This approach is common in data mining contests were
the train set is given to the competitors and the test set used to compare the submitted models is
provided latter. The metric used to compare the results is the |TPR − FPR| that by taking in the
account the random classification in the ROC space, will give us an unbiased metric. The use of
metrics like precision, f-measure, sensibility, specificity or accuracy would give us biased results
in consequence of the imbalance state of the data.
In this chapter, the diagnosis problem was addressed. For that a methodology was created
to create models using the clinical data, having in account the missing data and the class im-
balance. In this problem we aim at differentiating MCI and AD in an unbalanced data set, with
high dimensionality and with a high percentage of missing data. For this we defined and applied
the methodology described. In the progress of this work we found that, in contact with an inde-
pendent test set, the models that show higher generalization are Naıve Bayes, Artificial Neural
Network and kNN. Furthermore, a single feature subset is not always the best one. This allowed
us to conclude that the best features set depends on the algorithm used, and probably also on
the parameters used. The best models found have a |TPR − FPR| ≈ 0.6. This result shows as
that the diagnosis problem is indeed complex but that we achieve a good discriminative model for
the MCI and AD classes, using state of the art techniques. The best models are Naıve Bayes and
Neural Networks using correlation to select the features, and the kNN using all features. Other
metrics also indicate, in particular the area under the ROC, that the models obtained have a high
discriminative power.
43
3. Differentiating MCI from AD (Diagnosis)
44
4Predicting conversion from MCI to
AD (Prognosis)
The prognosis prediction of a patient is of great importance to the medical doctors. It allows
for adequate medical care to the patient and support for the family. The prognosis prediction of
Alzheimer’s Disease (or other cognitive impairment) has also a role in the decisions of the patient
about their future. For example if the conversion to AD occurs in a year, and this patient has a high
responsibility job, e.g, as a company manager or a pilot, the patient can adjust his life to minimize
the impact of his disease on the society.
4.1 Prognosis prediction approach
For prognosis prediction we use two different approaches. The first one, which is, normally
used in similar problems [8] [19] [27] [25], consists in finding if a patient will ever convert to AD.
This approach will be refereed in this work as First and Last Evaluation, since it looks for the first
and last entry of the patient in the database to determine if a patient will ever evolve from MCI to
AD. In this approach, each patient has only one single entry in the post-processed dataset.
The second approach looks at a given temporal window and tries to predict if a patient converts
from MCI (at the beginning of the temporal window) to AD (at the end). For this, and according
to Figure 4.1, a new set of labels has been created: evolution (Evol) and no evolution (noEvol)
instances. The noEvol class is considered the positive class. Notice that any instance with insuffi-
cient knowledge about the outcome is removed in the process, since the behaviour of the disease
is unknown inside the window.
45
4. Predicting conversion from MCI to AD (Prognosis)
To chose the temporal windows two factors were considered: (i) the instances distribution
between classes (Evol/noEvol) and (ii) the medical relevance which was obtained by consulting
the medical partners of the NEUROCLINOMICS project. For the latest case, a period of around
3 years was recommended. For the first case we extracted the class distribution according to
the temporal window size, see Figure 4.2. By analysing the evolution of the labels, in function of
the temporal window size, it can be observed that using a temporal window of 3 years balances
the classes Evol and noEvol. Thus, three temporal windows have been created with one year
of difference: 2 years, 3 years and 4 years. For the first and last evaluation approach after data
pre-processing, the class distribution is 37% for Evol and 63% for noEvol for both the training and
testing sets, respectively, as presented in Table 4.1.
MCI
EVOL
noEVOL
UNK-MCI-MCI
UNK-MCI-?
Preference
+
DEM
t
t
t
t
Figure 4.1: Graphical representation of the new class labels created for the temporal windowsprognosis problem.
0
50
100
150
200
250
300
350
400
0 500 1000 1500 2000 2500 3000
Inst
ance
s
Temporal Window (Days)
Evol
noEvol
UNK-MCI-MCI
UNK-MCI-?
Figure 4.2: Variation of the class labels with the size of the temporal window. This results aredone using all data. Only the Evolution(Evol) and no Evolution (noEvol) are used.
46
4.2 Classification Model
4.2 Classification Model
The classification model used for the prognosis prediction is simpler than the one used for
diagnosis. An independent test and train set has been created and a grid search is applied to find
the best model parameters for the training set. As in the diagnosis, the |TPR− FPR| metric was
used to determine the best model, which considers a balance between specificity and sensitivity.
Also three feature sets are used: Original (i.e., all features), correlation (that are obtained by
using correlated-based feature selection) and mRMR (that uses the mRMR feature selection) and
Figure 4.3 shows the model used in the grid search to find the best parameters using the train
set. The Figure 4.4 shows the model used for testing after having the found the parameters.
Data FeatureSelection
Cross-validation
Classifier(model)
Testingset
Training set
Results
Generation of 10 non-overlapping folds
(model training)
(model evaluation)
SMOTESynthetic
oversampling
Figure 4.3: Data Flow used in the parameter grid search for finding the classifiers parameters.The SMOTE percentage is tested with 11 different values for each parameter combination.
Parameters
Classifier (model)Test set
Training set
Results
SMOTESynthetic
oversampling
Figure 4.4: Data Flow used in simulate the real world results.
To deal with the missing data problem, a study similar to the one made in diagnosis for diag-
nosis was made. The following method is used to deal with missing:
• In Naıve Bayes, the internal way of ignoring the missing data is used, that simply exclude
them from the calculations.
• In SVMs, the internal way of median/mode imputation is used
• In kNN, the median/mode imputation is used, which gives better results than the WEKA’s
default of considering the maximum distance between instances.
47
4. Predicting conversion from MCI to AD (Prognosis)
• In C4.5 Decision Tree, median/mode imputation is used, since the tests showed us that this
method was better than the statistical by default.
• In Neural Networks, median/mode imputation is used is used, since the tests showed us
that this method was better than turn off the input neurode in case of missing value.
4.3 Results and Discussion
4.3.1 First and Last Evaluation
Table 4.1: FirstLastDemographicGroups noEvol Evol
Size 63% 37%Age (Mean ± SD) 68.6 ± 8.7 71.6 ± 8.1Sex (Male/Female) 89/124 41/82Schooling years (Mean ± SD) 8.7 ± 4.9 8.6 ± 5.0Time between assements (year) (Mean ± SD) 2.7 ± 2.3 2.5 ± 1.5
Table 4.2: First and Last Evaluation’s Parameters. Using the train set. The number of featuresselected are for All Features 153, Correlation 32 and mRMR 22. In this set the missing values arefor All Features 43%, Correlation 18% and mRMR 35%.
Classifier Feature Selection SMOTE Parameters
Naıve BayesAll Features 0% Supervised DiscretizationCorrelation 0% Supervised DiscretizationmRMR 150% Gaussian
SVM RBFAll Features 50% Compl = 4.5 and γ = 0.001Correlation 0% Compl = 1.5 and γ = 0.1mRMR 50% Compl = 4.0 and γ = 0.01
SVM PolyAll Features 500% Compl = 2.0 and Exp = 1Correlation 0% Compl = 0.5 and Exp = 1mRMR 250% Compl = 1.0 and Exp = 1
Neural NetworkAll Features 0% l = 0.3 , m = 0.1 and time = 1000Correlation 0% l = 0.3 , m = 0.2 and time = 1000mRMR 0% l = 0.3 , m = 0.1 and time = 1000
Decision Tree C4.5All Features 400% Conf = 0.35Correlation 400% Conf = 0.5mRMR 50% Conf = 0.15
kNNAll Features 350% k = 4Correlation 100% k = 9mRMR 350% k = 10
Table 4.2 presents the parameters found after performing grid search, as described in Section
4.2. The processed dataset has a minor imbalance, 63% of evolution vs 37% of noEvolution, (see
48
4.3 Results and Discussion
Table 4.1). Nevertheless the oversample technique, SMOTE, was applied with the balance state
being obtained by using approximately 70% of minority class oversampling. For the Naıve Bayes
classifier, oversampling has been used only with the mRMR features. In this case for example,
oversampling inverts the class balance. Another more extreme case example of this, is observed
when using the original set of features with the SVM Poly classifier. In this case, an oversampling
of 500% was chosen, which transforms the minority class in a majority class. Using the Decision
trees and kNN we note that in some cases the used oversampling also completely transforms the
class balance.
Statistically we can compare the models using 30 repetitions, which were obtained by running
the models 30 times with 30 different seeds. For this analysis we used paired t-test, where we
applied an all vs all procedure. The t-tests are only applied if the ANOVA test with a confidence
level of 95% confirms the existence of a significant difference. Using a paired t-test, an all vs all,
with 95% confidence, in each feature set, we observed:
• Original (All Features)
The Naıve Bayes and SVM RBF classifiers, do not have a significant difference. But those
two have a significant difference from all others models, having always a greater mean result.
The kNN only has a statistically insignificant difference to the Neural Network classifier,
having in all other cases a worst mean result.
• Correlation
The Naıve Bayes has a significant difference to all other models, having always a greater
mean result. The Neural Network model has a significant difference, but has in all cases a
worst mean result.
• mRMR
The Naıve Bayes and SVM RBF have again the best results, having a significant difference
from all others models. The model that has the worst result is the kNN.
Table 4.3: Best Classifiers for the First and Last problem, using the test set.Classifier FS Accuracy Sensitivity Specificity ROC Area |TPR− FPR|
Naıve Bayes Correlation 71% 72% 68% 0.67 0.40SVM RBF Original 67% 63% 77% 0.70 0.40SVM Poly Correlation 75% 84% 50% 0.67 0.34
NN Original 66% 70% 55% 0.70 0.25DT C4.5 mRMR 63% 70% 45% 0.59 0.16
kNN Correlation 53% 46% 73% 0.59 0.18
Now looking at the test set results, we can observe that the overall results are disappointing,
although they have a similar level of discriminating power or slightly higher comparing to the ones
49
4. Predicting conversion from MCI to AD (Prognosis)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
KNN
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
NeuralNetwork
DecisionTree C4.5
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
NaiveBayes SVM RBF SVM Poly
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
|TP
R-F
PR
|
|TP
R-F
PR
|
|TP
R-F
PR
|
|TP
R-F
PR
|
|TP
R-F
PR
|
|TP
R-F
PR
|
Figure 4.5: Test results of Prognosis using First And Last Evaluations
obtained by Maroco et al [35, 37]. Note however that a true comparison can not be made since the
dataset used in this work is slightly different. In this problem, we can see clearly that Naıve Bayes
and SVMs are the best algorithms in almost all datasets. The best result is |TPR − FPR| ≈ 0.4
using Naıve Bayes with Original features (all features) and using correlated features. In SVM
RBF the same result is obtained using Original feature set (all features). We can see that trying
to discriminate progression is not a simple task. Our results, for this approach, showed us that
in some cases we have some discriminative power but we never get a model with more than
approximately 0.4 in |TPR − FPR|. This means we get a model that is closer to the random
model than to the perfect one. Table 4.3 presents the result considering other metrics besides the
|TPR−FPR|. In all cases the Area under the ROC is below 0.70 and the accuracy is never above
75%. This results are in order with our previous conclusions using the |TPR − FPR| metric: the
obtained models, using this approach, have little predictive power.
4.3.2 Temporal Windows
One of the problems for not achieving better results with the first and last approach may came
from the dataset itself. Since the follow up time differs from patient to patient, confusion may arise
from the fact that a patient has not yet evolved because insufficient time has passed. As we will
see, this problem is mitigated by the temporal windows approach leading to substantially better
results. In accordance with to the medical opinion, three temporal windows are considered: 2, 3
and 4 years.
50
4.3 Results and Discussion
Two Years Temporal Window
Using the two years temporal window, we get a dataset with characteristics described in Table
4.4. In this dataset we have a minor imbalance, having 67% of the evaluations classified as
noEvolution (noEvol) and 33% of the evaluations classified as evolution (Evol). It is interesting to
observe that the age of the evolution group is older that the no evolution group.
Table 4.4: Dataset demographic after applying the pre-processing for the two years temporalwindow.
Groups 2 Years noEvol Evol
Size 181 (67%) 90 (33%)Age (Mean ± SD) 68.8 ± 8.2 72.5 ± 8.2Sex (Male/Female) 74/107 33/57Schooling years (Mean ± SD) 8.8 ± 5.1 8.7 ± 4.8
Table 4.5: Classification model parameters for the prognosis in a temporal window of 2 years. Thenumber of features selected are for All Features 153, Correlation 29 and mRMR 15. In this setthe missing values are for All Features 44%, Correlation 11% and mRMR 43%.
Classifier Feature Selection SMOTE Parameters
Naıve BayesAll Features 216 % Supervised discritizationCorrelation 108% GaussianmRMR 486% Gaussian
SVM RBFAll Features 54% Compl = 5.0 and γ = 0.01Correlation 54% Compl = 2.0 and γ = 0.1mRMR 162% Compl = 0.5 and γ = 1.0
SVM PolyAll Features 432 % Compl = 1.0 and Exp = 1Correlation 486% Compl = 1.0 and Exp = 1mRMR 162% Compl = 0.5 and Exp = 1
Neural NetworkAll Features 0% l = 0.2 , m = 0.1 and time = 1000Correlation 0% l = 0.2 , m = 0.1 and time = 2000mRMR 0% l = 0.2 , m = 0.2 and time = 2000
Decision Tree C4.5All Features 216% Conf = 0.05Correlation 378% Conf = 0.15mRMR 270% Conf = 0.35
kNNAll Features 54% k = 10Correlation 650% k = 3mRMR 216% k = 8
For the 2-years dataset the oversample needed to balance the data is approximately 100%.
By analysing Table 4.5 we can see that, in most cases, when oversampling is applied, the per-
centage used inverts the balance of classes. In those cases the oversample helps to define the
decision boundaries. The neural network does not use any oversampling; this is consistent with
the majority of the results that show that this algorithm is not sensitive to oversampling effects.
Statistically we can compare the models, using 30 repetitions. For this analysis we use paired
51
4. Predicting conversion from MCI to AD (Prognosis)
t-test and applied an all vs all strategy. The t-test are only applied if the ANOVA test with a 95%
confidence level confirms the existence of a significant difference. The following conclusions are
obtained:
• Original (All Features)
The two best models are Naıve Bayes and SVM RBF. Both have a significant difference to
the other models and have a higher mean. The Decision Tree model got the worst results,
having a significant difference with the worse mean except in the case of the kNN where no
statistically significant difference was found.
• Correlation
The Naıve Bayes model shows a significant difference to all other models and in all cases
shows a higher mean result. The decision tree model shows, the worst result having in all
cases a significant difference with a lower mean result.
• mRMR
The SVM RBF have a significant difference in all cases with a higher mean result. The
Decision trees are the worst model and only got a significant difference with a higher mean
result with the Neural Networks.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
KNN
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
NeuralNetwork
Decision Tree C4.5
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Naive Bayes SVM RBF SVM Poly
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
|TPR-FPR|
|TPR-FPR|
|TPR-FPR|
|TPR-FPR|
|TPR-FPR|
|TPR-FPR|
Figure 4.6: Test results of Prognosis using two years temporal window
Analysing the results of the test set presented in Figure 4.6, we can observe that 3 models
have in all datasets good results. Those models are: Naıve Bayes and the SVMs (with a linear and
52
4.3 Results and Discussion
Table 4.6: Best Classifiers for the prognosis problem with a 2 years temporal window, using thetest set.
Classifier FS Accuracy Sensitivity Specificity ROC Area |TPR− FPR|Naıve Bayes Correlation 75% 78% 64% 0.79 0.42
SVM RBF Original 76% 80% 64% 0.72 0.44SVM Poly mRMR 73% 69% 86% 0.78 0.55
NN Original 79% 84% 64% 0.84 0.48DT C4.5 Correlation 60% 63% 50% 0.62 0.13
kNN Correlation 60% 51% 93% 0.74 0.44
RBF kernel). The Neural network, with the original dataset have a result of almost |TPR−FPR| ≈
0.5. However these results are not achieved with the reduced feature sets. This means that the
choice of the reduced feature sets are not particularly well fit for the Neural Networks. Other
interesting result is the kNN using the correlated dataset, where we obtained a |TPR − FPR| ≈
0.45. With same algorithm but using mRMR dataset the result obtained is almost equal to a
random classifier. In this example we can see clearly the influence of the feature selection.
Analysing these results, we conclude that the three models with highest generalization are:
SVM polynomial using the mRMR features, Neural Networks using the Original dataset and kNN
using the correlated dataset. In all those models, a different feature set is used, which allow us
to conclude that the features influence differently the algorithms. The best model is the SVM
polynomial (linear) that uses a mRMR feature selection. Based on the other metrics (see Table
4.6) it can be observed that the Neural networks have the best ROC area, the best accuracy and
higher sensitivity. However their discrimination on the evolution class is lower.
Table 4.7: Dataset demographic after applying the pre-processing for the three years temporalwindow.
Groups 3 Years noEvol Evol
Size 122 (50%) 120 (50%)Age (Mean ± SD) 68.7± 7.9 72.5 ± 8.3Sex (Male/Female) 50/72 39/81Schooling years (Mean ± SD) 9 ± 5.2 8.6 ± 4.8
Three Years Temporal Window
In the three year’s dataset the data balance is not a problem since the two classes have the
same representativity. As we can see, the oversampling in the majority of the cases is null or very
reduced. As we already saw, the oversampling has the capability of overpopulating the dataset
to better define the decision frontier. This side effect is the main reason why oversampling is
sometimes used. Nevertheless we have cases where the oversampling percentage is higher,
such as the case with the Naıve Bayes. We can see that in this problem the Naıve Bayes got
53
4. Predicting conversion from MCI to AD (Prognosis)
Table 4.8: Classification model parameters for the prognosis in a temporal window of 3 years. Thenumber of features selected are for All Features 153, Correlation 21 and mRMR 17. In this setthe missing values are for All Features 43%, Correlation 12% and mRMR 42%.
Classifier Feature Selection SMOTE Parameters
Naıve BayesAll Features 266% Supervised discritizationCorrelation 266% Supervised discritizationmRMR 0% Gaussian
SVM RBFAll Features 0% Compl = 1.0 and γ = 0.1Correlation 76% Compl = 1.5 and γ = 0.1mRMR 0% Compl = 0.5 and γ = 1.0
SVM PolyAll Features 38% Compl = 1.5 and Exp = 1Correlation 0% Compl = 4.0 and Exp = 2mRMR 0% Compl = 1.5 and Exp = 2
Neural NetworkAll Features 0% l = 0.3 , m = 0.1 and time = 1000Correlation 0% l = 0.2 , m = 0.1 and time = 2000mRMR 0% l = 0.2 , m = 0.2 and time = 2000
Decision Tree C4.5All Features 38% Conf = 0.45Correlation 152% Conf = 0.45mRMR 228% Conf = 0.5
kNNAll Features 0% k = 5Correlation 0% k = 5mRMR 0% k = 9
some confusion on the decision borders, that is minimized by applying oversampling to the no
Evolution class. The decision trees have also the same problem in all features sets.
Statistically we can compare the models, using 30 repetitions. For this analysis we use paired
t-test and applied an all vs all strategy. The t-test are only applied if the ANOVA test with a 95%
confidence level confirms the existence of a significant difference. The following conclusions are
obtained:
• Original (All Features)
The Naıve Bayes and SVM RBF, do not have a statistical significant difference. But those
two have a significant difference from all others models, having always a higher mean result.
The Decision Tree have in all cases a significantly lower mean result, except versus the
Neural network where no significant difference has been found.
• Correlation
The Naıve Bayes have a significant difference from all others models, having a greater mean
result. The Decision Tree have in all cases a significantly worse mean result.
• mRMR
The SVM RBF have a significant difference from all others models, having always a higher
mean result. The Decision Tree have in all cases a significant difference and worse mean
54
4.3 Results and Discussion
result.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
KNN
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
NeuralNetwork
DecisionTree C4.5
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
NaiveBayes SVM RBF SVM Poly
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
|TP
R-F
PR
|
|TP
R-F
PR
|
|TP
R-F
PR
|
|TP
R-F
PR
|
|TP
R-F
PR
|
|TP
R-F
PR
|
Figure 4.7: Test results of Prognosis using three years temporal window
Table 4.9: Best Classifiers for the prognosis problem with a 3 years temporal window, using thetest set.
Classifier FS Accuracy Sensitivity Specificity ROC Area |TPR− FPR|Naıve Bayes mRMR 84% 94% 70% 0.86 0.64
SVM RBF mRMR 82% 79% 87% 0.83 0.66SVM Poly Original 73% 76% 70% 0.73 0.45
NN Original 71% 73% 70% 0.79 0.42DT C4.5 Original 73% 82% 61% 0.72 0.43
kNN Correlation 77% 76% 78% 0.78 0.54
In this case the dataset is not unbalanced, but oversampling helped to delimit the decision
boundaries. In the three years time window, test results show that all classifiers have a reasonable
behaviour. The best result obtained in this time window is |TPR − FPR| ≈ 0.65, a result that is
closer to the perfect classifier that from the random one. We obtained this result using two models:
the Naıve Bayes and SVM RBF both using the mRMR dataset. The neural network, have also
good results in all datasets, having a |TPR − FPR| ≈ 0.5 in all cases although best results are
achieved using the original set (all features). The kNN also have a good overall behaviour, having
with the correlated dataset a result of |TPR−FPR| ≈ 0.55. The worst results are achieved using
the decision trees. This can be explained since high confidence was obtained using the train set
which caused the model to overfit.
55
4. Predicting conversion from MCI to AD (Prognosis)
It should be noticed that the best results were obtained using no oversampling and using the
mRMR feature set. As suspected, the oversampling has little influence when a balanced dataset
is used. Nevertheless, in same cases, it is beneficial as it is the case of the Naıve Bayes. The
best model is the radial SVM (SVM RBF) using the mRMR feature selection.
Using different metrics (see Table 4.9), we can observe that Naıve Bayes achieves the highest
accuracy, sensitivity ans ROC area. However, its specificity is significantly lower to compared that
of radial SVM.
Table 4.10: Dataset demographic after applying the pre-processing for the four years temporalwindow.
Groups 4 Years noEvol Evol
Size 74 (35%) 140(65%)Age (Mean ± SD) 68.4 ± 8.1 71.8 ± 8.3Sex (Male/Female) 33/41 46/94Schooling years (Mean ± SD) 9.0 ± 5.1 8.7 ± 4.8
Table 4.11: Classification model parameters for the prognosis in a temporal window of 4 years.The number of features selected are for All Features 153, Correlation 17 and mRMR 19. In thisset the missing values are for All Features 44%, Correlation 14% and mRMR 43%.
Classifier Feature Selection SMOTE Parameters
Naıve BayesAll Features 203% Supervised DiscritizationCorrelation 290% Supervised DiscritizationmRMR 29% Gaussian
SVM RBFAll Features 203% Compl = 3.0 and γ = 0.01Correlation 58% Compl = 3.5 and γ = 0.1mRMR 87% Compl = 1.5 and γ = 0.1
SVM PolyAll Features 290% Compl = 1.0 and Exp = 1Correlation 87% Compl = 1.5 and Exp = 1mRMR 29% Compl = 1.0 and Exp = 1
Neural NetworkAll Features 0% l = 0.2 , m = 0.2 and time = 1000Correlation 0% l = 0.3 , m = 0.1 and time = 1000mRMR 0% l = 0.3 , m = 0.1 and time = 1000
Decision Tree C4.5All Features 87% Conf = 0.05Correlation 145% Conf = 0.35mRMR 174% Conf = 0.2
kNNAll Features 0% k = 8Correlation 0% k = 5mRMR 29% k = 10
Four Years Temporal Window
In the four years time window the balance of the data is inverted regarding the two years time
window. The class distribution is now 65% of Evolution and 35% of no Evolution. Table 4.11
56
4.3 Results and Discussion
shows the parameters after the grid search. It can be observed that oversampling is used in
almost all models, except in the Neural Networks; in kNN the oversampling was only used in
the mRMR feature set, and the oversampling value in this case is very low. Note that the used
SMOTE level in the Naıve Bayes and SVMs invert the data balance; in this case the SMOTE will
help the definition of the borders, as it happens in the previously studied cases.
Statistically we can compare the models, using 30 repetitions. As before, we use paired t-
test and applied an all vs all strategy. The t-test are only applied if the ANOVA test with a 95%
confidence level confirms the existence of a significant difference. The following conclusions are
obtained:
• Original (All Features)
The Naıve Bayes and SVM RBF, do not have a significant difference, but those two have
a significant difference from all others models, having always a higher mean result. The
Decision Tree have in all cases a significantly worse mean result.
• Correlation
The Naıve Bayes, SVM RBF and kNN do not have a significant difference between them
and have better mean result. The Decision Tree has in all cases a significantly worse mean
result.
• mRMR
The Naıve Bayes and SVM RBF, do not have a significant difference between themselves
but have a significantly difference from all others models, and a higher mean result. The
Decision Tree and Neural Network have no significant difference between them but in all
other cases have a significant difference and a worse mean result.
Table 4.12: Best Classifiers for the prognosis problem with a 4 years temporal window problem,using the test set.
Classifier FS Accuracy Sensitivity Specificity ROC Area |TPR− FPR|Naıve Bayes mRMR 74% 89% 64% 0.84 0.52
SVM RBF mRMR 81% 83% 80% 0.82 0.63SVM Poly Original 77% 78% 76% 0.77 0.54
NN mRMR 77% 72% 80% 0.88 0.52DT C4.5 Original 74% 61% 84% 0.71 0.45
kNN Correlation 79% 72% 84% 0.76 0.56
Using the four years temporal window we also got an overall good performance in the test
set. The best model found is the SVM RBF using the mRMR feature set. For this problem the
mRMR is the best in 3 of the 6 used algorithms. The best results are obtained using the SVM
RBF with mRMR feature selection and kNN with the correlated dataset. These results show us
a classifier closer of the perfect classifier than from the random one. Again we found that the
57
4. Predicting conversion from MCI to AD (Prognosis)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
KNN
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
NeuralNetwork
Decision Tree C4.5
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Naive Bayes SVM RBF SVM Poly
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
|TP
R-F
PR
|
|TP
R-F
PR
|
|TP
R-F
PR
|
|TP
R-F
PR
|
|TP
R-F
PR
|
|TP
R-F
PR
|
Figure 4.8: Test results of Prognosis using four years temporal window
classification performance depends of the features and the classification algorithm. The worse
result is obtained when we use the decision trees with mRMR feature set. This show us that with
this feature set the Decision tree does not have a good generalization power. The best model in
this approach is the Gaussian SVM (SVM RBF) using the mRMR feature selection. By analysing
the other metrics (see Table 4.12) we conclude that SVM has a high balanced result with 83%
sensitivity, 80% specificity and 0.82 ROC area.
4.4 Summary
In this chapter, we studied two approaches to process the data in order to predict conversion
from MCI to AD (prognosis), using state of the art methods. The standard method, that in this
work is named as First and Last Evaluation, was tested to define a baseline in discriminative
power. The temporal windows, have been presented as an alternative to this method and as a
way to take into account the different patient profiles that a patience can have in their evaluation
history. With the temporal windows approach, we got a higher discriminative power, using the test
set (pre-processed for each problem). The results show that using temporal windows increases
the prediction capability of the models. In this work, we also compare our approach to the First
and Last evaluation with the Maroco et al [36] work, achieving slightly better results. Using the
temporal windows the results that we got have a greater discriminative power.
In the size of the temporal windows, we got a better results in the test set using the three years
temporal window, as we expected from the medical feedback.
58
4.4 Summary
Table 4.13: Best Models to each progression approachApproach Classifier Accuracy Sensitivity Specificity ROC Area |TPR− FPR|
First & Last Naıve Bayes 71% 72% 68% 0.67 0.42 Years SVM Poly 73% 69% 86% 0.78 0.553 Years SVM RBF 82% 79% 87% 0.83 0.664-Years SVM RBF 81% 83% 80% 0.82 0.63
59
4. Predicting conversion from MCI to AD (Prognosis)
60
5Decision Support System
The models created in chapter 3 and 4 showed an overall good performance in the tasks of
discriminating MCI patients (diagnosis) and predicting the progression from MCI to AD (progno-
sis). But these models are unusable by the medical doctors. To bridge this usability impediment,
a solution was designed and implemented in this thesis, to facilitate the use of the system by
third parties. By integrating the models in an information system, the medical doctors can now
evaluate them in a real work situation. Since this work was done in the context of the NEURO-
CLINOMICS project, the integration of the models into an application that can be used by the
healthcare professionals is of huge importance. Having this in mind, a modular Decision Support
System (DDS) webservices architecture was developed, which integrates with other tools devel-
oped in the project, in particular with the AD information system, that is under development. The
use of web services allows model updating without altering any other part of the system.
The DDS was designed in a modular way and is composed of the following components:(i)
a data input system where healthcare professionals introduce data relative to the instances or
update previous ones;(ii) a prediction system that computes the patient diagnosis (2, 3, 4 years)
based on previously trained models;(iii) an automatic training tool that updates model parame-
ters (feature set selection, oversampling percentage and the algorithm parameters) based on the
complete know instances.
61
5. Decision Support System
5.1 System
Using WEKA, six models have been created for each problem: diagnosis and prognosis in
a temporal window of two, three and four years. These models are parametrized using the tool
created and explained previously. The use has the possibility of choosing any of the models
or even all. For example, the user can choose from the 6 models for a specific problem those
that have more confidence in the results, and present to the final user a confidence interval.
The user can also test a patient evaluation and the system returns the diagnosis prediction, and
if this diagnosis is MCI, then the system show the prognosis results. In this way the system
integrates the diagnosis and prognosis in a simple way. The constant update of the database, will
Figure 5.1: DDS web service architecture
decrease the missing data, since in new evaluations the clinicians are now using higher number
of assessment test (features). This change likely lead system to choose new feature sets, for
example, since features with less missing values are likely more relevant to the classification.
Database updates can also change the class balance. So the models must be updated and
re-parametrized having in account this factors. For this, the parametrizable system, described
in chapter 3, is used to acquire the new parameters and then create new models to reflect the
changes.
The implementation of this DSS was performed using web services. This technology allow
us to deploy the services in an application server, in the network that services can be remotely
accessed by using a defined message scheme, e.g, XML. The DDS also allows easy integration
of new services or updating existing ones.
For this work four webservices have been created, one for the diagnosis and one for each
prognosis temporal window. The webservice receives as argument the patient evaluation and
the model to be used, and returns the confidence of the prediction. The possible models to be
62
5.1 System
Figure 5.2: Prototype data input screen.
used are: Naıve Bayes, SVM (Gaussian or polynomial), kNN, C4.5 decision trees and Artificial
neural networks. Figure 5.1 presents the architecture of the proposed system. The client will
send a request with the patient evaluation by the Network to the webservice that will interrogate
the classification models. In the end a response is given to the client.
Figure 5.3: Prototype output screen.
Using these webservices we can construct applications that allows a client to easily access
the information created by the models. Figure 5.2 shows a prototype scheme of data insertion.
In this prototype there are two boxes, one to select the classification model (Neural Networks,
SVMs, Naıve Bayes, kNN and C4.5 Decision Tree) and other, called ”Patient Evaluation”, were
patient evaluation is inserted. After clicking in ”submit”, the classification request is made to
the webservice. The response to the request is given and the application showed in Figure 5.3
appears.
A system already exists that uses the created webservices, which was created by a NEURO-
CLINOMICS team member in the scope of an AD information system. In Figure 5.4 a screen shot
63
5. Decision Support System
of the system is shown. In the table (of the Figure 5.4) all predictions to all models are displayed;
the values correspond, in the Diagnostic model to the probability of the patients be MCI and for
the prognosis, the probability of the patient not to evolve to AD. The graphic shows the maximum,
average and minimum for each problem (Diagnosis and for the three temporal windows). This
graphic shows confidence intervals created by using multiple models with the same instance.
In Figure 5.4, a real instance is used, the user inquiries the system about the an instance
of the patient 9, evaluated at 22/11/2004. The response indicates that the patient is MCI with a
models average probability of 95%. And the probability of not converting to AD is high in 2, 3 or
4 years (a model average of 90%, 92% or 90% respectively), this means that the evolution to AD
probability is very low, in 4 years.
Figure 5.4: DDS system screen. The values on diagnosis represent the probability of the patientbe MCI, in prognosis case the values represent the probability of not evolve to AD.
5.2 Summary
In this chapter a decision support system is briefly described, using a web services archi-
tecture. The DDS was created to facilitate the use of the system by the medical doctors and
64
5.2 Summary
to integrate the work done in this thesis with the NEUROCLINOMICS project. In this chapter a
prototype is described and a functional application is shown. Since the created system is highly
modular the classification models upgrade is easy and can be done automatically.
65
5. Decision Support System
66
6Conclusions and Future Work
In this work we study the diagnosis and prognosis prediction for patients suffering from Alzheimer’s
disease. To perform this work we use a dataset consisting of neuropsicological numeric tests and
the corresponding diagnosis. Because new neuropsicological tests were added to the dataset
along the time, it suffers from missing values. Furthermore, since there are many more MCI in-
stances in the dataset than AD instances, it also suffers from class imbalance. Thus, in this work
there is the influence of the class imbalance, high dimensionality and missing values. To evaluate
the influence of all these factors an unique approach was designed to create and evaluate the
data. This unique model uses a grid search that combines oversampling and multidimensional
parameters search.
To evaluate the results without a bias, in case of an unbalance dataset the |TPR − FPR|
metric is used in all models of this work. This metric is a trade-off between the sensibility and
specificity.
In this work we tackle the diagnosis problem and create models that discriminate MCI and
AD cases. We analysed the behaviour of a set of supervised data mining algorithms and we con-
cluded that the Naıve Bayes and Neural Networks have a better performance when in contact with
an unknown test set. Those results are obtained using the Original set of features (All Features)
and a Correlated set. We can also conclude that the use of 10-fold-cross-validation provides an
estimate of the goodness of the result that is not analogue to the obtained test set. This can be
caused by an overfitting to the training set, that leads to a low generalization of the models. Simi-
lar results are obtained in all problems analysed. One of the best diagnostic model was obtained
using Naıve Bayes algorithm, with an accuracy of 91%, a sensitivity of 93%, a specificity of 69%,
67
6. Conclusions and Future Work
a ROC Area of 0.85 and a |TPR− FPR| of 0.62.
In the prognosis problem we present a new approach to predicting the conversion from MCI
to AD. The standard method is to use the first and last evaluation of the patient. This approach,
in our opinion, will not use important information, such as profiles that a patient can have in their
evaluation history. In that 10 years a patient can pass though profiles that some other patients
can also have. By using a temporal window approach, we obtain better discriminative results. We
concluded that the best models to use are the Naıve Bayes and SVMs algorithms, and that the
mRMR feature set showed us very good results, generally better than those using the original
set or the correlated set. The temporal window with the highest discriminative power is the 3
years window. Using this window the best model was obtained using radial SVM algorithm, which
has an accuracy of 82%, a sensitivity of 79%, a specificity of 86%, a ROC Area of 0.83 and a
|TPR− FPR| of 0.64.
Finally, we created a Decision support system, that uses the diagnosis and prognosis models.
This system can help the medical doctors to evaluate in a short space of time the patients. This
system was implemented using web services to integrate this work in the NEUROCLINOMICS
project. The use of web services allows this work to be integrated in other works in the scope of
the project since it uses a simple communication protocol.
Future work
To improve the quality of the decision support system, including the prediction models new
approaches and techniques should be investigated. This includes the usage of state-of-art super-
vised or unsupervised data discritization and feature selection techniques, or new classification
techniques, or new classification models including those using boosting approaches. Further-
more, in order to identify patient profiles, unsupervised clustering techniques can be applied. This
would allow the development of specialized models for specific groups of patients and therefore
improve the prediction accuracy. It should be noted that, while some of these techniques were
applied in the course of this thesis, no significant results were achieved. Nonetheless, we believe
that case application of this techniques should be able to identify groups of patients.
Finally, in order to improve the decision support system information, techniques should be
studied to tackle the time to conversion problem, where one predicts how much time will pass
until the patient converts from MCI to AD.
68
Bibliography
[1] Abdi, H. and Williams, L. J. (2010). Principal component analysis. Wiley Interdisciplinary
Reviews: Computational Statistics.
[2] Bekris, L. M., Yu, C.-E., Bird, T. D., and Tsuang, D. W. (2010). Review article: Genetics of
alzheimer disease. Journal of Geriatric Psychiatry and Neurology.
[3] Boonchuay, K., Sinapiromsaran, K., and Lursinsap, C. (2011). Minority split and gain ratio for
a class imbalance.
[4] Breiman, L. (2001). Random forests. Machine learning, 45(1):5–32.
[5] Breiman, L., Friedman J. H., Olshen, R. A., and Stone, C. J. (1984). Classification and
regression trees.
[6] C, G. (1984). Doenca de Alzheimer, problemas do diagnostico clınico. PhD thesis, Faculdade
de Medicina de Lisboa.
[7] Chapman, R., Mapstone, M., McCrary, J., Gardner, M., Porsteinsson, A., Sandoval, T., Guillily,
M., DeGrush, E., and Reilly, L. (2011a). Predicting conversion from mild cognitive impairment
to alzheimer’s disease using neuropsychological tests and multivariate methods. Journal of
Clinical and Experimental Neuropsychology, 33(2):187–199.
[8] Chapman, R. M., Mapstone, M., McCrary, J. W., Gardner, M. N., Porsteinsson, A., Sandoval,
T. C., Guillily, M. D., DeGrush, E., and Reilly, L. A. (2011b). Predicting conversion from mild
cognitive impairment to alzheimer’s disease using neuropsychological tests and multivariate
methods. Journal of Clinical and Experimental Neuropsychology.
[9] Chawla, N. V. (2010). Data Mining for Imbalanced Datasets: An Overview, pages 875–886.
Number 40. Springer US.
[10] Chawla, N. V., Bowyer, K. W., Hall, L. O., and Kegelmeyer, W. P. (2002). SMOTE: Synthetic
Minority Over-sampling TEchnique. Journal of Artificial Intelligence Research 16.
[D’Agostino et al.] D’Agostino, R. B., Lee, M.-L., Belanger, A. J., Cupples, L. A., Anderson, K.,
and Kannel, W. B. Relation of pooled logistic regression to time dependent cox regression
analysis: The framingham heart study.
69
Bibliography
[12] Data, M. (2007). Missing data in clinical. Group, pages 453–460.
[13] de Lemos, L. J. M., Silva, D., Guerreiro, M., Mendonca, A., Tomas, P., and Madeira, S.
(2012a). Discriminating alzheimer?s disease from mild cognitive impairment using neuropsy-
chological data.
[14] de Lemos, L. J. M., Silva, D., Guerreiro, M., Mendonca, A., Tomas, P., and Madeira, S.
(2012b). Predicting conversion from mild cognitive impairment to alzheimer’s disease using
neuropsychological data: Preliminary results.
[15] Dietterich, T. G. and Bakiri, G. (1995). Solving multiclass learning problems via error-
correcting output codes. Journal of Artificial Intelligence Research 2.
[16] Duan, K.-B. and Keerthi, S. S. (2005). Which is the best multiclass svm method? an empirical
study. Springer-Verlag Berlin Heidelberg.
[17] Elkan, C. (2001). The foundations of cost-sensitive learning. International Joint Conference
on Artificial Intelligence, 17(1):973–978.
[18] Ewers, M., Walsh, C., Trojanowski, J., Shaw, L., Petersen, R., Jack Jr, C., Feldman, H.,
Bokde, A., Alexander, G., Scheltens, P., et al. (2010a). Prediction of conversion from mild
cognitive impairment to alzheimer’s disease dementia based upon biomarkers and neuropsy-
chological test performance. Neurobiology of Aging.
[19] Ewers, M., Walsh, C., Trojanowski, J. Q., Shaw, L. M., Petersen, R. C., Jr., C. R. J., Feldman,
H. H., Bokde, A. L., Alexander, G. E., Scheltens, P., Vellas, B., Duboisl, B., Weiner, M., and
Hampel, H. (2010b). Prediction of conversion from mild cognitive impairment to alzheimer’s
disease dementia based upon biomarkers and neuropsychological test performance. Elsevier.
[20] Guerreiro, M., Silva, A. P., B., M. A., L., Castro-Caldas, A., and Garcia, C. (1994). Adaptacao
a populacao portuguesa da traducao do mini mental state examination (mmse). Revista
Portuguesa de Neurologia.
[21] Hall, M. (1999). Correlation-based feature selection for machine learning. PhD thesis, The
University of Waikato.
[22] Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., and Witten, I. H. (2009). The
WEKA data mining software: an update. SIGKDD Explorations, 11(1):10–18.
[23] Han, J. and Kamber, M. (2006). Data Mining:Concepts and Techniques. Diane Cerra, 2
edition.
[24] He, H. and Garcia, E. (2009). Learning from imbalanced data. Knowledge and Data
Engineering, IEEE Transactions on, 21(9):1263 –1284.
70
Bibliography
[25] Hinrichs, C., Singh, V., Xu, G., and Johnson, S. (2011). Predictive markers for ad in a
multi-modality framework: An analysis of mci progression in the adni population. NeuroImage,
55(2):574–589.
[26] Hinrichs, C., Singh, V., Xu, G., and Johnson, S. C. (2010). Predictive markers for ad in a
multi-modality framework: An analysis of mci progression in the adni population. NeuroImage.
[27] Jack Jr, C., Wiste, H., Vemuri, P., Weigand, S., Senjem, M., Zeng, G., Bernstein, M., Gunter,
J., Pankratz, V., Aisen, P., et al. (2010). Brain beta-amyloid measures and magnetic resonance
imaging atrophy both predict time-to-progression from mild cognitive impairment to alzheimer’s
disease. Brain, 133(11):3336–3348.
[28] John, G. and Langley, P. (1995). Estimating continuous distributions in Bayesian classifiers.
In Proceedings of the eleventh conference on uncertainty in artificial intelligence, pages 338–
345. Morgan Kaufmann Publishers Inc.
[29] Kass, G. (1980). An exploratory technique for investigating large quantities of categorical
data. Applied statistics, pages 119–127.
[30] Keerthi, S., Shevade, S., Bhattacharyya, C., and Murthy, K. (2001). Improvements to Platt’s
SMO algorithm for SVM classifier design. Neural Computation, 13(3):637–649.
[31] Kolibas, E., Korinkova, V., Novotny, V., Vajdickova, K., and Hunakova, D. (2000). Adas-cog
(alzheimer’s disease assessment scale-cognitive subscale)–validation of the slovak version.
PubMed.
[32] Liu, H. and Setiono, R. (1996). A probabilistic approach to feature selection-a filter solution.
In Proceedings of the 13th International Conference on Machine Learning, pages 319–327.
Morgan Kaufmann.
[33] Loewenstein, D., Greig, M., Schinka, J., Barker, W., Shen, Q., Potter, E., Raj, A., Brooks, L.,
Varon, D., Schoenberg, M., et al. (2012). An investigation of premci: Subtypes and longitudinal
outcomes. Alzheimer’s and Dementia, 8(3):172–179.
[34] Loh, W. and Shih, Y. (1997). Split selection methods for classification trees. Statistica sinica,
7:815–840.
[35] Maroco, J., Silva, D., Guerreiro, M., de Mendonca, A., and Santana, I. (2011a). Prediction
of dementia patients: A comparative approach using parametric vs. non parametric classifiers.
XIX Congresso Anual da Sociedade Portuguesa de Estatıstica.
[36] Maroco, J., Silva, D., Rodrigues, A., Guerreiro, M., Santana, I., and de Mendonca, A.
(2011b). Data mining methods in the prediction of dementia: A real-data comparison of the
accuracy, sensitivity and specificity of linear discriminant analysis, logistic regression, neural
71
Bibliography
networks, support vector machines, classification trees and random forests. BMC research
notes, 4(1):299.
[37] Maroco, J., Silva, D., Rodrigues, A., Guerreiro, M., Santana, I., and de Mendonca, A.
(2011c). Data mining methods in the prediction of dementia: A real-data comparison of the
accuracy, sensitivity and specificity of linear discriminant analysis, logistic regression, neural
networks, support vector machines, classification trees and random forests. BMC research
notes, 4(1):299.
[38] MF, F., SE, F., and PR., M. (1975). ”mini-mental state”. a practical method for grading the
cognitive state of patients for the clinician. PubMed.
[39] Mika, S., Ratsch, G., Weston, J., Scholkopf, B., and Mullers, K. (1999). Fisher discriminant
analysis with kernels. In Proceedings of the 1999 IEEE Signal Processing Society Workshop
on Neural Networks for Signal Processing, pages 41–48. IEEE.
[40] Mufson, E., Binder, L., Counts, S., DeKosky, S., deToledo Morrell, L., Ginsberg, S.,
Ikonomovic, M., Perez, S., and Scheff, S. (2012). Mild cognitive impairment: pathology and
mechanisms. Acta neuropathologica, 123(1):13–30.
[41] Neter, J., Kutner, M. H., Naschsheim, C., and Wasserman, W. (1996). Applied Linear
Regression Models. The McGraw-Hill Companies.
[42] Noorbakhsh, F., Overall, C. M., and Power, C. (2009). Deciphering complex mechanisms
in neurodegenerative diseases: the advent of systems biology. Trends in Neurosciences,
32(2):88–100.
[43] Peng, H. P. H., Long, F. L. F., and Ding, C. (2005). Feature selection based on mutual in-
formation criteria of max-dependency, max-relevance, and min-redundancy. IEEE Transactions
on Pattern Analysis and Machine Intelligence, 27(8):1226–1238.
[44] Platt, J. C., Cristianin, N., and Shawe-Taylor, J. (2000). Large margin dags for multiclass
classi?cation. Advances in Neural Information Processing Systems 12.
[45] Powers, D. M. W. (2011). Evaluation : From precision , recall and f-measure to roc , informed-
ness , markedness and correlation. Journal of Machine Learning Technologies, 2(1):37–63.
[46] Quinlan, J. (1993). C4.5: Programs for Machine Learning. Morgan kaufmann.
[47] Quinlan, J. R. (1986). Induction of decision trees. Machine Learning, 1(1):81–106.
[48] Raileanu, L. E. and Stoffel, K. (2004). Theoretical comparison between the gini index and
information gain criteria. Annals of Mathematics and Artificial Intelligence, 41(1):77–93.
72
Bibliography
[49] Robert, P., Ferris, S., Gauthier, S., Ihl, R., Winblad, B., and Tennigkeit, F. (2010). Review of
alzheimer’s disease scales: is there a need for a new multi-domain scale for therapy evaluation
in medical practice? Alzheimer’s Research & Therapy.
[50] Samtani, M., Farnum, M., Lobanov, V., Yang, E., Raghavan, N., DiBernardo, A., Narayan,
V., et al. (2011). An improved model for disease progression in patients from the alzheimer’s
disease neuroimaging initiative. The Journal of Clinical Pharmacology.
[51] Silva, D., Guerreiro, M., Maroco, J., Santana, I., Rodrigues, A., Bravo Marques, J., and
de Mendonca, A. (2012). Comparison of four verbal memory tests for the diagnosis and pre-
dictive value of mild cognitive impairment. Dementia and Geriatric Cognitive Disorders Extra,
2(1):120–131.
[52] Silva, D., Santana, I., do Couto, F. S., J Maroco, M. G., and de Mendonca, A. (2008). Cogni-
tive deficits in middle-aged and older adults with bipolar disorder and cognitive complaints:
Comparison with mild cognitive impairment. INTERNATIONAL JOURNAL OF GERIATRIC
PSYCHIATRY.
73
Bibliography
74
AAppendix Medical exams (in
Portuguese)
75
A. Appendix Medical exams (in Portuguese)
Table A.1: Feature List part-1Feature Type Descritpion
Case number for this database Numeric Rank of CasesAge Numeric Age at evaluationDiagNPS String Diagnosis from PsychologistDiagnosis code Numeric Neuropsychological and clinical DiagnosisDisease duration Numeric Evolution of cognitive symptoms in yearsDate Date Date of the evaluationSchool Numeric Years of formal educationGroup Numeric Group in BLAD controlsGender NumericBirth Date Date of birthAs cut Numeric Corte de As cuts (Min=0; Max(pontuacao
melhor)=16)As time Numeric Corte de As time (valor em segundos; quanto
menor melhor)As tot Numeric Corte de As total (Corte de As cuts/Corte de
As time*10)DS forw Numeric Digit Span forward (Min=0; Max(pontuacao
melhor)=9)DS back Numeric Digit Span backward (Min=0; Max(pontuacao
melhor)=8)DS tot Numeric Digit Span total (Digit Span forward+Digit
Span backward)PA Easy Numeric Verbal Paired-Associate Learning Easy
(Min=0; Max(pontuacao melhor)=18)PA Dif Numeric Verbal Paired-Associate Learning Difficult
(Min=0; Max(pontuacao melhor)=12)PA Tot Numeric Verbal Paired-Associate Learning - To-
tal [(Verbal Paired-Associate LearningEasy/2)+Verbal Paired-Associate LearningDifficult]
PA Inter Easy Numeric Verbal Paired-Associate Learning (with Inter-ference) Easy
PA Inter Dif Numeric Verbal Paired-Associate Learning (with Inter-ference) Difficult
LM a Numeric Logical Memory A (Min=0; Max(pontuacaomelhor)=23)
LM b Numeric Logical Memory B (Min=0; Max(pontuacaomelhor)=22)
LM tot Numeric Logical Memory total (Logical MemoryA+Logical Memory B/2)
LM a Cued Numeric Logical Memory A Cued (Min=0;Max(pontuacao melhor)=23)
LM b Cued Numeric Logical Memory B Cued (Min=0;Max(pontuacao melhor)=22
LM a Interf Numeric Logical Memory (with Interference) A (Min=0;Max(pontuacao melhor)=23)
LM b Interf Numeric Logical Memory (with Interference) B (Min=0;Max(pontuacao melhor)=22
LM tot Interf Numeric Logical Memory (with Interference) To-tal [Logical Memory (with Interference)A+Logical Memory (with Interference) B/2]
LM a Interf Cued Numeric Logical Memory (with Interference A) Cued(Min=0; Max(pontuacao melhor)=23)
76
Table A.2: Feature List part-2Feature Type Descritpion
LM a Interf Cued Numeric Logical Memory (with Interference A) Cued (Min=0;Max(pontuacao melhor)=23)
LM b Interf Cued Numeric Logical Memory (with Interference B) Cued (Min=0;Max(pontuacao melhor)=22
MVI Free Numeric Word Recall with Interference Free (Min=0;Max(pontuacao melhor)=15)
MVI Cued Numeric Word Recall with Interference Cued (Min=0;Max(pontuacao melhor)=10)
MVI Rec Numeric Word Recall with Interference Recognition (Min=0;Max(pontuacao melhor)=5)
MVI Tot Numeric Word Recall with Interference total (Word Recall with Inter-ference Free+Word Recall with Interference Cued+WordRecall with Interference Recognition)
Infor Numeric Test about General Information (Min=0; Max(pontuacaomelhor)=20)
VisualM A Numeric Visual Memory (image A) (Min=0; Max(pontuacao mel-hor)=3)
VisualM B Numeric Visual Memory (image B) (Min=0; Max(pontuacao mel-hor)=5)
VisualM C1 Numeric Visual Memory (image C1) (Min=0; Max(pontuacao mel-hor)=3)
VisualM C2 Numeric Visual Memory (image C2) (Min=0; Max(pontuacao mel-hor)=4)
VisualM total Numeric Visual Memory total [Visual Memory (image A)+VisualMemory (image B)+Visual Memory (image C1)+VisualMemory (image C2)]
Or Total Numeric Orientation total (Orientation Personal+Orientation Spa-tial+Orientation Temporal)
Orient P Numeric Orientation Personal (Min=0; Max(pontuacao melhor)=5)Orient S Numeric Orientation Spatial (Min=0; Max(pontuacao melhor)=3)Orient T Numeric Orientation Temporal (Min=0; Max(pontuacao melhor)=7)
Fluency Sem Numeric Verbal Fluency (quanto mais elevada a pontuacao melhor)Fluency Phon Numeric Phonologic Fluency (quanto mais elevada a pontuacao
melhor)M Initiative Numeric Motor Initiative (Min=0; Max(pontuacao melhor)=3)
Gm Initiative Numeric Grafomotor Initiative (Min=0; Max(pontuacao melhor)=2)Writing Numeric Writing (Min=0; Max(pontuacao melhor)=2)Comp Numeric Orders Compreension (Min=0; Max(pontuacao melhor)=4)Ident Numeric Objects Identification (Min=0; Max(pontuacao melhor)=5)
Token T Numeric Token Orders (total) (Min=0; Max(pontuacao melhor)=17)Naming Numeric Naming (Min=0; Max(pontuacao melhor)=7)
Repetition Numeric Repetition (Min=0; Max(pontuacao melhor)=11)Token Complete Numeric Complete version of Token (Min=0; Max(pontuacao mel-
hor)=22)Snodgrass missing Numeric Snodgrass and Vanderwart - numero de palavras em falta
Snodgrass end String Snodgrass and Vanderwart - numero total de palavras ap-resentadas
Public Faces missing Numeric Public Faces - numero de palavras em faltaPublic Faces end String Public Faces -numero total de palavras apresentadas
Prxs Numeric Motor Coordenation (Min=0; Max(pontuacao melhor)=12)Cube Numeric draw of a cube (Min=0; Max(pontuacao melhor)=3)Clock Numeric draw of a clock (Min=0; Max(pontuacao melhor)=3)
77
A. Appendix Medical exams (in Portuguese)
Table A.3: Feature List part-3Feature Type Description
Calc Numeric Calculation (Min=0; Max(pontuacao melhor)=14)M Calc Numeric Mental Calculation (Min=0; Max(pontuacao melhor)=11)
MPR Numeric Raven Progressive Matrices (Min=0; Max(pontuacao mel-hor)=12)
Proverb Numeric Verbal Abstraction (Min=0; Max(pontuacao melhor)=9)Stroop 1 Numeric STROOP - leitura (max=100; quanto mais elevada a
pontuacao melhor)Stroop 2 Numeric STROOP - nomeacao de cores (max=100; quanto mais
elevada a pontuacao melhor)Stroop 3 Numeric STROOP - interferencia com a palavra (max=100; quanto
mais elevada a pontuacao melhor)JLO Numeric Judgment of Line Orientation - correct answers (quanto
mais elevada a pontuacao melhor)Facial recognition Numeric Facial Recognition Test Record Form - Normal (41-54);
Borderline (39-40); Mod Imp (37-38); Sev Imp (¡37)WAIS cubos Numeric Wechsler Adult Intelligence Scale - cubes (0-52; quanto
mais elevada a pontuacao melhor)WAIS semelhancas String Wechsler Adult Intelligence Scale - similarities (0-26;
quanto mais elevada a pontuacao melhor)WAIS vocabulario String Wechsler Adult Intelligence Scale - vocabulary (0-80;
quanto mais elevada a pontuacao melhor)WAIS codigo String Wechsler Adult Intelligence Scale - symbol search (0-90;
quanto mais elevada a pontuacao melhor)WAIS lacunas Numeric Wechsler Adult Intelligence Scale - picture completion (0-
21; quanto mais elevada a pontuacao melhor)MMSE Numeric Mini-Mental State Examination (Min=0; Max(pontuacao
melhor)=30)TPRT Numeric Toulouse-Pieron (Rendimento de Trabalho=numero de
certos-(omissoes+erros); pontuacao maior melhor)TPID Numeric Toulouse-Pieron (Indice de Dis-
persao=(omissoes+erros)/certos*100; pontuacao menormelhor)
TMT A temp Numeric Trail Making Test (Part A) - Tempo (valor em segundos,se ultrapassar os 180sec normalmente interrompe-se aprova; quanto menos tempo melhor)
TMT A err Numeric Trail Making Test (Part A) - Erros (sem pontuacao maxima,quanto menos melhor)
TMT B temp Numeric Trail Making Test (Part B) - Tempo (valor em segundos,se ultrapassar os 300sec normalmente interrompe-se aprova; quanto menos tempo melhor)
TMT B err Numeric Trail Making Test (Part B) - Erros (sem pontuacao maxima,quanto menos melhor)
TMT B incomplete Numerica1 Numeric CVLT - Lista A - 1.a evocacao (Min=0; Max(pontuacao mel-
hor)=16)a2 Numeric CVLT - Lista A - 2.a evocacao (Min=0; Max(pontuacao mel-
hor)=16a3 Numeric CVLT - Lista A - 3.a evocacao (Min=0; Max(pontuacao mel-
hor)=16a4 Numeric CVLT - Lista A - 4.a evocacao (Min=0; Max(pontuacao mel-
hor)=16a5 Numeric CVLT - Lista A - 5.a evocacao (Min=0; Max(pontuacao mel-
hor)=16
78
Table A.4: Feature List part-4Feature Type Description
a1a5 Numeric CVLT - Lista A de 1 a 5 Total (somatorio das 5 evocacoes;Min=0; Max(pontuacao melhor)=80)
a pers Numeric CVLT - Lista A Perseveracoes (somatorio da repeticaode palavras nas 5 evocacoes; sem pontuacao maxima,quanto menor melhor)
a intr Numeric CVLT - Lista A Intrusoes (somatorio das palavras novasacrescentadas a lista ao longo das 5 evocacoes; sempontuacao maxima, quanto menor melhor)
b tot Numeric CVLT - Lista B (Min=0; Max(pontuacao melhor)=16)b pers Numeric CVLT - Lista B Perseveracoes
b intr Numeric CVLT - Lista B Intrusoesb cs Numeric CVLT - Lista B Cluster Semantic (CS; numero de agrupa-
mentos de palavras da mesma categoria)a cr int Numeric CVLT - Lista A - Evocacao espontanea apos curto intervalo
(Min=0; Max(pontuacao melhor)=16)a crint pers Numeric CVLT - Ev. esp. curto intervalo - perseveracoesa crint intr Numeric CVLT - Ev. esp. curto intervalo - intrusoesa crint cs Numeric CVLT - Ev. esp. curto intervalo - CS
a crint ajsem Numeric CVLT - Evocacao apos curto intervalo com ajudasemantica (Min=0; Max(pontuacao melhor)=16)
a crint ajsem pers Numeric CVLT - Ev. curto intervalo com ajuda semantica -perseveracoes
a crint ajsem intr Numeric CVLT - Ev. curto intervalo com ajuda semantica - intrusoesa lg int Numeric CVLT - Lista A - Evocacao apos longo intervalo (Min=0;
Max(pontuacao melhor)=16)a lgint pers Numeric CVLT - Ev. esp. longo intervalo - perseveracoesa lgint intr Numeric CVLT - Ev. esp. longo intervalo - intrusoesa lgint cs Numeric CVLT - Ev. esp. longo intervalo - CS
a lgint ajsem Numeric CVLT - Evocacao apos longo intervalo com ajudasemantica (Min=0; Max(pontuacao melhor)=16)
a lgint ajsem pers Numeric CVLT - Ev. longo intervalo com ajuda semantica -perseveracoes
a lgint ajsem intr Numeric CVLT - Ev. longo intervalo com ajuda semantica - in-trusoes
rec a Numeric CVLT - Reconhecimento apos longo intervalo Lista A(Min=0; Max(pontuacao melhor)=16)
rec Bp Numeric CVLT - Reconhecimento - Lista B partilhados(Min=0(pontuacao melhor); Max=4)
rec Bn Numeric CVLT - Reconhecimento - Lista B nao partilhados(Min=0(pontuacao melhor); Max=4)
rec P Numeric CVLT - Reconhecimento - Prototipo (Min=0(pontuacaomelhor); Max=4)
rec sr Numeric CVLT - Reconhecimento - sem relacao (Min=0(pontuacaomelhor); Max=16)
GDS Numeric Geriatric Depression Scale (Min=0(pontuacao melhor);Max=15)
QSM Total Numeric Escala de Queixas Subjectivas de Memoria(Min=0(pontuacao melhor); Max=22)
BlessedAVD Numeric Blessed (Total of Part 1 - Daily living activities)(Min=0(pontuacao melhor); Max=8)
BlessedHAB Numeric Blessed (Total of Part 2 - Habits) (Min=0; Max=9)BlessedPERS Numeric Blessed (Total of Part 3 - Personality) (Min=0; Max=11)
79
A. Appendix Medical exams (in Portuguese)
Table A.5: Feature List part-5Feature Type Description
BlessedTOT Numeric Blessed TOTAL (Blessed (Total of Part 1 -Daily living activities)+Blessed (Total of Part2 - Habits)+Blessed (Total of Part 3 - Person-ality))
CancellationTask Z Numeric Nota ZDigitSpan Z Numeric Nota Z
DigitSpan forward Z Numeric Nota ZDigitSpan backward Z Numeric Nota Z
SemanticFluency Z Numeric Nota ZMotorInitiative Z Numeric Nota Z
GraphomotorInitiative Z Numeric Nota ZComprehension Z Numeric Nota Z
Identification Z Numeric Nota ZToken Z Numeric Nota Z
Naming Z Numeric Nota ZRepetition Z Numeric Nota Z
Writing Z Numeric Nota ZOrientation Z Numeric Nota ZWordRecall Z Numeric Nota Z
GeneralInformation Z Numeric Nota ZVerbalPaired AssociateLearning Z Numeric Nota Z
LogicalMemory Z Numeric Nota ZLogicalMemory A Z Numeric Nota Z
LM DR Z Numeric Nota ZVisualMemory Z Numeric Nota Z
Cube Z Numeric Nota ZClock Z Numeric Nota Z
CubesWAIS Z Numeric Nota ZCalculation Z Numeric Nota Z
MPR Z Numeric Nota ZProverbs Z Numeric Nota Z
TP RT Z Numeric Nota ZTP ID Z Numeric Nota Z
TMT A Z Numeric Nota ZTMT B Z Numeric Nota Z
A1 Z Numeric Nota ZA5 Z Numeric Nota Z
Atot Z Numeric Nota ZB Z Numeric Nota Z
SDFR Z Numeric Nota ZSDCR Z Numeric Nota ZLDFR Z Numeric Nota ZLDCR Z Numeric Nota Z
REC Z Numeric Nota ZToken Complete Z Numeric Nota Z
80
BAppendix Diagnosis
81
B. Appendix Diagnosis
Table B.1: Selected Features, for diagnosis, using correlation and mRMR.Correlation mRMR
As cut rec aAs tot ProverbLM a Cued WritingLM a Interf PA Inter DifInfor NamingOr Total a crint ajsem persOrient P a crint persOrient S DS backOrient T BlessedHABFluency Sem MPRWriting CompNaming Orient SCube M CalcClock Orient TMPR a lgint csBlessedAVD Gm InitiativeBlessedTOT BlessedAVDCancellationTask Z MVI FreeDigitSpan Z Writing ZDigitSpan backward Z ClockSemanticFluency Z Naming ZGraphomotorInitiative Z Orient POrientation ZWordRecall ZGeneralInformation ZVerbalPaired AssociateLearning ZLogicalMemory A ZClock ZCalculation ZMPR ZProverbs ZAtot Z
82
CAppendix Prognosis
C.1 First and Last Evaluation
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
KNN
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
NeuralNetwork
DecisionTree C4.5
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
NaiveBayes SVM RBF SVM Poly
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
|TP
R-F
PR
|
|TP
R-F
PR
|
|TP
R-F
PR
|
|TP
R-F
PR
|
|TP
R-F
PR
|
|TP
R-F
PR
|
Figure C.1: Train results of Prognosis using First and Last Evaluations
83
C. Appendix Prognosis
Table C.1: Selected Features, for the prognosis using the First and last evaluation. The usedtechniques are correlation[32] and mRMR [43]
Correlation mRMR
Age VisualM BAs time ClockAs tot As cutPA Dif Orient PLM a b csLM b LM b Interf CuedLM tot GenderLM a Interf a lg intMVI Free TMT B incompleteOr Total Gm InitiativeOrient T MVI FreeCube a lgint ajsem persMPR Repetitiona1a5 TMT B Za lg int Orient SLogicalMemory Z NamingLogicalMemory A Z CubeProverbs Z a lgint csTMT B Z Orient TAtot Z Writing
CalcBlessedHABa lgint persPA Inter DifCompToken TPrxsComprehension ZTP RT Z
C.2 Temporal window: Two years
C.3 Temporal window: Three years
C.4 Time window: Four years
84
C.4 Time window: Four years
Table C.2: Selected Features, for the prognosis using the two years temporal window. The usedtechniques are correlation [32] and mRMR [43]
Correlation mRMR
Age WritingPA Easy Orient PPA Tot DS backLM a TMT A errLM a Interf a lgint ajsemMVI Free NamingOr Total Or TotalFluency Sem CompCalc MVI FreeCancellationTask Z CalcOrientation Z a crint persGeneralInformation Z CubeVerbalPaired AssociateLearning Z TMT B incompleteCube Z b csMPR Z Orient SAtot Z PA Inter Dif
BlessedHABM CalcTMT B temp
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
KNN
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
NeuralNetwork
Decision Tree C4.5
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Naive Bayes SVM RBF SVM Poly
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
|TP
R-F
PR
|
|TP
R-F
PR
|
|TP
R-F
PR
|
|TP
R-F
PR
|
|TP
R-F
PR
|
|TP
R-F
PR
|
Figure C.2: Train results of Prognosis using two years temporal window
85
C. Appendix Prognosis
Table C.3: Selected Features, for the prognosis using the three years temporal window. The usedtechniques are correlation [32] and mRMR [43].
Correlation mRMR
Age DS backDS back a crint ajsem intrPA Easy CompPA Tot a3LM a Cued NamingLM a Interf b csLM tot Interf Or TotalMVI Free CubeOrient T MVI FreeFluency Sem a crint persCalc M Calca2 BlessedHABCancellationTask Z PA Inter DifDigitSpan Z a crint csOrientation Z TMT B incompleteWordRecall Z WritingGeneralInformation Z TPIDVerbalPaired AssociateLearning ZLogicalMemory A ZCube ZMPR Z
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
KNN
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
NeuralNetwork
Decision Tree C4.5
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Naive Bayes SVM RBF SVM Poly
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
|TP
R-F
PR
|
|TP
R-F
PR
|
|TP
R-F
PR
|
|TP
R-F
PR
|
|TP
R-F
PR
|
|TP
R-F
PR
|
Figure C.3: Train results of Prognosis using three years temporal window
86
C.4 Time window: Four years
Table C.4: Selected Features, for the prognosis using the Four years temporal window. The usedtechniques are correlation[32] and mRMR [43].
Correlation mRMR
PA Tot Orient PLM a Fluency PhonLM a Cued LDCR ZLM a Interf Writing ZLM tot Interf CompMVI Free b csInfor LM b Interf CuedOrient T MVI FreeFluency Sem TMT A errCube Orient TBlessedAVD NamingSemanticFluency Z CubeOrientation Z a crint csWordRecall Z M CalcVerbalPaired AssociateLearning Z a lgint persMPR Z PA Inter DifA5 Z TMT B incomplete
BlessedHABTPRT
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
KNN
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
NeuralNetwork
Decision Tree C4.5
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Naive Bayes SVM RBF SVM Poly
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
|TP
R-F
PR
|
|TP
R-F
PR
|
|TP
R-F
PR
|
|TP
R-F
PR
|
|TP
R-F
PR
|
|TP
R-F
PR
|
Figure C.4: Train results of Prognosis using four years temporal window
87
C. Appendix Prognosis
88