exit*cbr: a framework for case-based medical diagnosis development and experimentation

Artificial Intelligence in Medicine 51 (2011) 81–91

Contents lists available at ScienceDirect

Artificial Intelligence in Medicine

journa l homepage: www.e lsev ier .com/ locate /a i im

eXiT*CBR: A framework for case-based medical diagnosis developmentand experimentation

Beatriz Lopez a,b,*, Carles Pous a,b, Pablo Gay a, Albert Pla a, Judith Sanz c, Joan Brunet b,d

a Control Engineering and Intelligent Systems Research Group, Universitat de Girona, Campus Montilivi, edifice P4, 17071 Girona, Spainb Girona Biomedical Research Institute, Av. de Franca s/n, 17007 Girona, Spainc Unitat de Consell Genetic en Cancer Hereditari, Servei d’Oncologia Medica, Hospital de la Santa Creu i Sant Pau, Sant Antoni Maria Claret, 167, 08025 Barcelona, Spaind Medical Oncology Department, Catalan Institute of Oncology, Av. Franca s/n, 17007 Girona, Spain

A R T I C L E I N F O

Article history:

Received 15 December 2008

Received in revised form 30 April 2010

Accepted 12 May 2010

Keywords:

Case-based reasoning

Experimentation supporting tool

Development supporting tool

Medical diagnosis

Breast cancer diagnosis

A B S T R A C T

Objective: Medical applications have special features (interpretation of results in medical metrics,

experiment reproducibility and dealing with complex data) that require the development of particular

tools. The eXiT*CBR framework is proposed to support the development of and experimentation with

new case-based reasoning (CBR) systems for medical diagnosis.

Method: Our framework offers a modular, heterogeneous environment that combines different CBR

techniques for different application requirements. The graphical user interface allows easy navigation

through a set of experiments that are pre-visualized as plots (receiver operator characteristics (ROC) and

accuracy curves). This user-friendly navigation allows easy analysis and replication of experiments. Used

as a plug-in on the same interface, eXiT*CBR can work with any data mining technique such as

determining feature relevance.

Results: The results show that eXiT*CBR is a user-friendly tool that facilitates medical users to utilize CBR

methods to determine diagnoses in the field of breast cancer, dealing with different patterns implicit in

the data.

Conclusions: Although several tools have been developed to facilitate the rapid construction of

prototypes, none of them has taken into account the particularities of medical applications as an

appropriate interface to medical users. eXiT*CBR aims to fill this gap. It uses CBR methods and common

medical visualization tools, such as ROC plots, that facilitate the interpretation of the results. The

navigation capabilities of this tool allow the tuning of different CBR parameters using experimental

results. In addition, the tool allows experiment reproducibility.

� 2010 Elsevier B.V. All rights reserved.

1. Introduction

Case-based reasoning (CBR) has recently evolved, and aconsiderable number of techniques are currently available foreach stage of CBR system development, although they are unevenlydistributed. Each particular domain requires that appropriatetechniques be selected for each phase and the parameters involvedbe appropriately tuned. Several methods have been designed toguide CBR system development [1,2], and some of these areaccompanied by software tools that allow building prototypesquickly, which can then be progressively corrected [3–5]. Somegeneral purpose tools such as cF [6], CBR* [3], CBR Shell [4] and

* Corresponding author at: Universitat de Girona, Campus Montilivi, edifice P4,

17071 Girona, Spain. Tel.: +34 972 418 880; fax: +34 972 418 259.

E-mail addresses: [email protected] (B. Lopez), [email protected]

(C. Pous), [email protected] (P. Gay), [email protected] (A. Pla),

[email protected] (J. Sanz), [email protected] (J. Brunet).

0933-3657/$ – see front matter � 2010 Elsevier B.V. All rights reserved.

doi:10.1016/j.artmed.2010.09.002

jCOLIBRI [7] are available and can be used to develop a cancerdiagnosis system as well as an electrical fault diagnosis system.

Although there should always be a compromise betweengenerality and specificity when designing software tools, medicalapplications have particular requirements that justify the devel-opment of a specific CBR platform. In [8], for example, the authorshighlight the need to study what is common to all successfulapplications of CBR systems and to search for common representa-tions of CBR data and methods in the medical field. Other studieshave looked at the special cases of CBR [9,10] and data mining [11]in the medical domain. In [10] we describe our experience ofworking together with several teams of medical users, and weagree with the problems presented in [8] and [11]. From thatanalysis, we detected the need to deploy a particular tool for CBR inmedicine that can cope with the specific challenges identified:attaining a probabilistic interpretation of CBR, improving theinterface to interpret the medical results, and the hybridization ofCBR with other techniques (as for example fuzzy systems [18]).

http://dx.doi.org/10.1016/j.artmed.2010.09.002

mailto:[email protected]






http://www.sciencedirect.com/science/journal/09333657

http://dx.doi.org/10.1016/j.artmed.2010.09.002

B. Lopez et al. / Artificial Intelligence in Medicine 51 (2011) 81–9182

First, a way to provide a probabilistic interpretation of CBR isclosely related to the kind of visualization tools medical usersusually work with: receiver–operator characteristic (ROC) and costcurves. ROC curves depict the tradeoff between hit rate and falsealarm rate [12,13]. In cost curves, classifier performance is shownthrough the class distribution changes [14]. Although not CBRtechniques per se, the incorporation of these visualization tools in aCBR environment should facilitate the use of CBR in medicaldomains.

Second, the use of ROC curves can also improve the interfacewith medical users. Moreover, result reproducibility (the sameresults obtained by different teams given the same method andidentical test material) should be guaranteed to offer medical usersconfidence. And since repeating experiments is often critical andnecessary to obtain the parameters that provide the best results ina CBR system, special care should be taken to adequately store allthe data used when the system is tested so they are available to theresearch community.

And third, the complementarity of other methods andtechniques for feature learning and for dealing with datadependencies [8] should be contemplated, so that CBR can behybridized with data mining and reasoning tools.

To our knowledge, no current tool supports all these features, inparticular the first two. The final CBR performance depends on theskills of the engineer choosing the corresponding parametersguided by the interpretation of the results obtained so far. Unlesswe provide the appropriate visualization tools for result interpre-tation and reproduction, it will be difficult to adequately tune theCBR parameters.

In this work a specific CBR tool for medical applications ispresented to support CBR system development, replicate experi-ments and interpret results. The framework, eXiT*CBR1 is amodular environment that combines different CBR techniques fordifferent application requirements. The graphical user interfaceallows easy navigation through a set of experiments that are pre-visualized as plots (ROC and accuracy curves). This allows aknowledge engineer and a medical user working side by side to testdifferent CBR parameters and visualize the results in a user-friendly way. Experiment replication is assured, since experimentsare managed automatically by the system. The method is based onkeeping datasets, configuration files of the experiments, and anydata involved in the experiment (as attribute weights).

Finally, any data mining method can be plugged into the systemas a pre-processing step to treat the medical data, according to thecurrent trends in knowledge transfer [15]. Thus, both theknowledge engineer and the medical user have a single interfaceto deal with complex data. eXIT*CBR is based on CBR as a decision-making method, although it can be complemented by any othertechnique (e.g. attribute weight learning method [19]).

Therefore, the eXiT*CBR platform goes beyond the rapidprototyping of medical CBR to offer an extensible architecturefor overall experimentation, and to support experiment replicationand interpretation via common medical graphics tools. eXiT*CBRdoes not build a new system from scratch, but rather integratesexisting CBR methods with new ones as required, and facilitatesthe integration of data mining techniques.

This article is organized as follows. First, we introduce theeXiT*CBR framework. Next, we briefly describe the eXiT*CBRfunctionalities. Then, an example is used to describe the main stepsrequired to develop a CBR application with eXiT*CBR, obtainresults, and have them evaluated by the medical users. Finally, wecompare our work with several previous studies and end the paperwith some conclusions.

1 eXiT stands for the acronym of our research group.

2. The eXiT*CBR framework

The eXiT*CBR framework was designed to bring together CBRmethods currently used in medical applications and data miningand visualization techniques that can be plugged into it. eXiT*CBRalso facilitates the incorporation of new techniques, if required.Our architecture follows a modular approach based on thedifferent phases in a CBR system.

For a reader unfamiliar with CBR methods, a basic CBR systemrelies on a case base in which past experiences are stored. Forexample, in a breast cancer diagnosis, a case base might containinformation about patients who have been diagnosed with theillness as well as information about healthy people. A CBR systemuses the case base to diagnose future cases in four steps: retrieve,reuse, revise and retain [16]. In the retrieve step, the current case iscompared with all the past experiences in the case base, and themost similar are recovered. Next, in the reuse step, a solution(diagnosis) to the current case is determined based on thesolutions found in the retrieved cases. Third, the computedsolution is evaluated in the revise step. Finally, the retain stepanalyzes whether to keep the case in the case base.

eXiT*CBR follows a modular approach: the different CBR stepsare implemented as generic classes. When a new method isrequired but not provided in the system, it can be assembled as aparticular instance of a generic class. Thus, when users want to addnew CBR methods, they only need to instantiate the generic classesdefined for this purpose. However, it is important to distinguishbetween CBR methods and methods that involve either theintegration of data mining tools or hybridization with otherartificial intelligence (AI) techniques (as for example, a fuzzysystem [18] in the retrieve step). When other techniques need to beintegrated or hybridized, the corresponding executable codesshould also be included. This approach gives the systemmodularity, reusability and extensibility.

As our framework goes beyond pure CBR prototyping and aimsto support experimentation, in addition to the basic CBR modules,five other elements are required: the experiment interpreter core,the CBR engine, the pre-process element, the post-process elementand the experiment navigator. As experiments are essential toeXiT*CBR, the architecture is organized into different componentsaround the experiment interpreter core (see Fig. 1). Case represen-tation and the different components are explained in theremainder of this section.

eXiT*CBR has been developed using the Java language and theJFreeChart library.2 It is compatible with the Linux and Windowsoperating system platforms. The software allows users to select alanguage to work in including English and Catalan. We are in theprocess of making it possible to download the platform for freefrom our web page. In the meantime, however, any interestedresearcher can request a copy from the authors.

2.1. Case representation

eXiT*CBR always uses plain csv file to handle data. Each rowcorresponds to a case, and each column to attributes of the cases.One of the attributes should be marked as the one containing thediagnosis information (class).

The first four rows describe the attributes as follows:

� T

si

he first row corresponds to the attribute descriptions (forexample, ‘‘age when first child was born’’)
� T he second row corresponds to the attribute name (usually in a
cryptic form, as for example, ‘‘age1stborn’’).

2 In fact, medical users usually collect data in Microsoft Excel or SPSS files, with a

milar format.

[(Fig._1)TD$FIG]

Fig. 1. eXiT*CBR framework.

B. Lopez et al. / Artificial Intelligence in Medicine 51 (2011) 81–91 83

� T
he third row corresponds to the attribute type (�1 ignore, 0discrete, 1 numerical, 2 textual, 3 date) � T he fourth row corresponds to the attribute weight (relevance).
We believe that this simple, plain representation covers most ofthe data used in medical applications3 and is easy to manage andgeneral enough to be used by any of the current CBR techniques(mainly distance functions).

From the original data file in csv format, and using, for example,the facility described in Section 3, several datasets can begenerated for different experimental methods. All of the datasetsemploy the same case representation.

2.2. Experiment interpreter: the eXiT*CBR core

The core of the framework is the experiment interpreter. Aconfiguration file that contains all the information required toreplicate an experiment is defined. The eXiT*CBR core interpretsthe configuration file, and applies the corresponding methods, orcalls the corresponding eXiT*CBR components: first, the prepro-cess, then the CBR engine, then, the post-process. Only theexperiment navigator interacts with the user.

The different parameters involved in an experiment depend onthe method employed. Currently, there are two main methods:batch processing and cross-validation.

In eXiT*CBR batch processing means performing a single run ofa CBR in a case base (see Fig. 2, top). The input to this kind ofexperiment is a single dataset with two files: the training and thetest files. The experiment interpreter core takes the training data togenerate the case base, and uses the test file to obtain the results ofthe CBR system defined in the configuration file. These results aregiven in the form of CBR performance according to experimentalmeasures selected by the user. Note that the solutions of the testcases are known in advance, and are used to verify that the CBRsystem, configured as it is, works appropriately.

Stratified cross-validation processing implies multiple runs of aCBR configuration with different datasets (see Fig. 2, bottom). Inputdata in this kind of experiment are k datasets, ‘‘set1’’, ... ‘‘setk’’, eachcomposed of at least one training file and one test file. Thus, theexperiment interpreter core takes the training data of a given set‘‘set1’’ as the case base, and uses the test data to obtain the results of

3 The details of the file contents are described in Section 4.

the CBR system. Then, the performance of the CBR system with allof the datasets is averaged.

eXiT*CBR also allows a multi-agent system (MAS, [17]) methodin which several CBR systems cooperate to obtain a diagnosis.Thus, for example, an experiment could be MAS-cross-validationor MAS-batch. For the sake of simplicity, in this paper we focus onthe single (non-MAS) methods.

The data to be used, the pre-processing techniques, the CBRmethods, and how the results are post-processed are defined in theconfiguration file. First, batch processing requires that a singledataset be specified, while cross-validation requires a set of them.Other information related to the data is the name of the attributethat contains the solution of the case (i.e. class information sincewe are interested in case-based medical diagnosis). Second, pre-processing methods, such as the discretization of numericalattribute values, the normalization of numerical attribute values,or the filtering out of a subset of case attributes, can be defined.Third, the techniques to be used in each of the four main phases ofCBR should be specified. Both CBR methods and other AItechniques can be included. For example, in Fig. 3 the retrieve_-

selection method employed is MajorityK with the parameter k = 5but a fuzzy system that determines case similarities can be alsoused (for example see [18]).4 Combining different AI techniques inhybrid systems seems to tackle the complexity and multi-facetedaspects of medical data. And fourth, post-processing depends onthe environment where the CBR system is designed to work. If thesystem is developed to cooperate with other systems to solve agiven problem (on-line mode), the output of the CBR systemshould go through a communication port; however, if the purposeof the CBR system is to obtain results in isolation (off-line mode),visualization tools could be more useful. In both cases, severalperformance measures can be specified to show the results to theuser.

When an experiment is run, all the files associated with it arestored in a folder with the name of the method used to conduct theexperiment (cross-validation or batch), and the day and the time ittook place. For example, in Fig. 5 the experiment folder with thename ‘‘CrossValidation-CBR-20080516-124248’’ means that thisfolder contains the files belonging to an experiment carried out at12:42:48 PM, on 16 May 2008, using cross-validation as theexperimental method. This label ensures that repeating the same

4 These are cases in medical terms, but in order to avoid confusion with the CBR

terminology we prefer to use the term ‘‘patient’’.

[(Fig._2)TD$FIG]

Fig. 2. The two main experimental methods in eXiT*CBR.


experiment with the same data a couple of seconds later does notoverwrite the previous experiment results, even though the sameexperiment file name is used. The information stored in the resultfolder is used by the post-processor component to plot the resultsof the current experiment for the user, and by the navigator tool tocompare different experiments.

2.3. The pre-process element

The first component of the framework that the experimentinterpreter core calls to perform an experiment is the pre-processelement, which applies the description of the discretization,normalization and feature selection slots of the configuration file.Discretization methods convert numerical data into categoricaldata [19]; normalization methods assure that all the numericalvalues are comparable in the [0,1] interval; and feature selectionmethods determine which of the available features (attributes) arerelevant for the application. Relevant and irrelevant features can beassigned and labeled with high and low weights, respectively[20,19].

After the pre-process is completed, the experiment core takesback control and passes it to the CBR engine.

2.4. The CBR engine

All the information required to set up a CBR system according touser requirements is stored in the configuration file. The CBRengine is responsible for reading this file, extracting selectedmethods and parameters and, finally, calling and executing therelated algorithms. At the end of this process a set of files isobtained as output. These files are used by the post-processelement of the architecture to display the results in the format theuser selects.

The following paragraphs describe the modules that the CBRengine calls when launched.

2.4.1. Retrieve module

The retrieve module compares the cases with a given test case,and selects the most similar cases from the case base. Three keymethods are involved in this process:

� T
he distance method or similarity measure employed to comparecases � T he methods employed to handle missing information when the
similarity measures are applied
� T he selection procedure to determine the most similar cases.
All these methods are defined in a generic way and can beinstantiated as required or as they become available (as depicted inFig. 4).

Similarity measures can be local or global (Fig. 4). Localsimilarity measures compare two attribute values. There can be asmany kinds of local similarity measures as there are operandsavailable. For example, the Euclidean distance is the measureproposed most often to handle numerical data, while the Hammingdistance is set for categorical (discrete) data [21]. Local similaritymeasures regarding data trees (to handle inheritance information,for example), series, and dates can also be used. Global similaritymeasures combine different local similarity outcomes to deter-mine the similarity between two cases. Weighted average is one ofmany examples [22].

Local distance methods have to be able to deal with missingvalues. They are cost sensitive and used widely in medicalapplications. According to [23], they can be classified as missingcompletely at random (MCAR), missing at random (MAR), or notmissing at random (NMAR). MCAR is used when the probability ofmissing a value is the same for all attributes; MAR is used when theprobability of missing a value is only dependent on anotherattribute; and NMAR is used when the probability of missing avalue is also dependent on the value of the missing attribute. Anyof these methods can be implemented and added to eXiT*CBR.

Finally, the selection method must be chosen. Possibilitiesinclude selecting the k-nearest neighbors or selecting the caseswith a similarity degree higher than a pre-fixed threshold.

As output of the retrieve phase, the CBR engine creates a filecontaining the distance matrix that depicts the similitude betweeneach test case and the cases in the case base.

2.4.2. Reuse module

In the reuse module a method for adapting the solutions of theretrieved cases to a given test case needs to be specified. This phase

[(Fig._3)TD$FIG]

######################################

### Configuration File v 2.0 ####

######################################

# Data:

########

Directory = C:/Results

Name = Original

Data = /Datasets10

Class = A1_1_1#8

#############

# Pre-process:

#############

Discretization = EqualWidthIntervalBinning;B2_6#42/5.0

Normalization = MaxMin

AttributeSelection = null

#############

# CBR Model:

############

Retrieve

Retrieve_Unknown = SimpleAll

Retrieve_Num_Local_Distance = Euclidean

Retrieve_Cat_Local_Distance = Hamming

Retrieve_Global_Distance = Average

Retrieve_Selection = MajorityK;5

Reuse = ReuseProbabilisticPous;[0:0.1:1],0.7

Revise = null

Retain = Never

#################

# Experimentation:

#################

Measures = truePositiveFalsePositiveRates

Visualization = ROC

Method = CrossValidation

##############

# Post-process:

##############

Post-process = Off_line

### End

Fig. 3. Configuration file.


constitutes one of the main limitations of CBR in medical diagnosis[24]. The majority of medical CBR systems suggest past solutionswithout an additional adaptation process and the Bilska–Wolakmethod [25] proposes a probabilistic approach. However, thissecond method has only been applied with a low number offeatures, and much more research is needed to deploy it in real
[(Fig._4)TD$FIG]
Fig. 4. Retrieve

environments. eXiT*CBR allows users to create and test their ownreuse algorithm easily.

2.4.3. Revise module

Most of the current medical CBR systems rely on humanfeedback in the revise phase. There are currently no otheralternatives. However, in other environments, simulators are alsopossible [26].

In the experiments used with eXiT*CBR, namely, batch andcross-validation, the case solution is known in advance. Thus, twomain results are generated:

� T

m

he classification matrix, which contains information about thereal classification of the cases and the prediction given by thesystem (as output of the reuse phase)
� T he confusion matrix, which contains information about true
positive, true negative, false positive and false negative rates.

2.4.4. Retain module

Once the solution proposed for the new case is revised, adecision has to be made about whether to retain the new case ornot. This decision will depend on whether or not the cases alreadyin the case base produce a correct solution or not. Since the CBRsystem’s core is a case base that can be very large, it has a lot incommon with the data mining methods for data processing. Whenthe case base grows, a good maintenance policy is necessary. Thismeans that we have to delete, add or modify cases to keep thesystem performing well. Instance-based learning algorithms suchas IB3 [27] or DROP4 [28] can be applied in this stage to decidewhether to keep the new case, forget it, or keep it and delete othercases.

Adding cases automatically to a medical domain without anyhuman control can be somewhat dangerous, since the informationabout a particular case or situation can be increased, leading to alarge bias in the system. An alternative is to store new casestemporarily until a medical user checks them. CBRworks [29] hasimplemented this approach. In eXiT*CBR we follow this conserva-tive view of the development of CBR systems, and we leave thestudy of the integration of algorithms such as IB3 and DROP4 forthe future.

2.5. The post-process element

The task of the post-process element starts when the CBRengine finishes. The post-process component of eXiT*CBR is relatedto the type of measures the user is interested in (confusion matrix,success rate, specificity, etc.) and how to visualize them. Inaddition, it processes the information according to the workingmode selected by the user: on-line or off-line. More details aregiven in the following subsections.

2.5.1. Visualization results

In the configuration file, the user chooses, in advance, themethod and measures to apply to the current experiment, as well

odules.

[(Fig._5)TD$FIG]

Fig. 5. File structure.


as how to view the results. Then, the post-process componentproduces the corresponding plot from the information provided bythe CBR engine (confusion matrices). A Figure.png, generated asoutput, depicts the corresponding plot derived from the experi-ment.

In the current implementation the experiment navigator tool

uses the confusion matrix and the plots to help the user choose andcompare experiments (see next section).

By default eXiT*CBR visualizes the results with ROC curves (likethe one in Fig. 10). However, there are certain situations in whichvisually comparing ROC curves is not the best way to select the bestdiagnostic system; in these cases using an area under curve (AUC)value can help make the choice. The larger the AUC value, thebetter the performance of the diagnostic system. Therefore, theROC figures are complemented with AUC information. Of course,the ROC curves and the AUC values have to be checked by medicalusers to certify that these results agree with their medical criteria.

2.5.2. Working mode

The differences between the off-line and the on-line modes areimportant. In the off-line mode, several batch processes andvalidation procedures are applied according to the experimentalparameters and the system’s answers in this context arevisualization plots. In the on-line mode the system cooperateswith other systems to solve a given problem using approachesbased on ensemble learning techniques, including distributed andagent-based approaches [30].

2.6. The experiment navigator

Repeating experiments and interpreting their results are key tomedical diagnosis. Reproducibility should guarantee that when thesame method and the same test material are used, identical resultsare obtained. So, all the data used when the system is tested has tobe adequately stored in a file structure (see Fig. 5).

As shown in Fig. 5, all the experimental outcomes are stored inthe results folder. In it, users can organize the results into severaldestination folders (‘‘Destination folder 1’’ and ‘‘Destination folder2’’ in the case in Fig. 5) according to selected criteria. For example,users might be interested in putting all the experiments carried outwith a particular database or a set of experiments carried out usingonly a subset of the data in the same destination folder.

The experiment navigator works interactively, so that when themouse is positioned on an element of the file structure, the resultsof the experiments are automatically visualized. This is possiblewith the pre-computed images and the information kept after anexperiment is run.

3. eXiT*CBR functionalities

The following functionalities have been developed for users:create/edit configuration file, data conversion, modification csv,dataset generation, CBR application, and navigation. They are madeavailable as buttons in the left area of the main window of theplatform. In addition, all of the data mining applications integratedinto the platform are available through the ‘‘plug-in’’ option. Inaddition, it is possible to change the installation of the tool with theprogram configuration facility.

3.1. Create/edit configuration file

Writing a configuration file by hand can be so unpleasant itwould make anyone give up using the tool. We have developed analternative editor interface that allows users to define aconfiguration file in a user-friendly way, in which all the availabletechniques are displayed in pop-up menus. The directory that

contains the datasets, the methods used for each CBR cycle, and thetype of experiment can be defined using this graphical window.

3.2. Modify csv

Machine learning methods in general and CBR in particularassume that examples are independent. However, medical datacannot always satisfy this constraint. For example, in breast cancerdata, two members of the same family can also be in the samedataset. Note that this situation is different from deduplication orrecord linkage [31], since the individuals are different but related.

This clear dependency can be removed if family relationshipsare identifiable in the dataset. Of course, other kinds of hiddendependencies are more difficult to avoid.

Thus, this facility has links to the current available, domain-dependent methods to deal with case independence.

3.3. Data conversion

The data provided by medical users generally have different fileformats and contain dissimilar information. So, the first step is totransform original datasets into csv files, which is the format ourframework is able to read. A breast cancer relational database hasbeen converted into a csv file as an example. In general, everymedical database has its own structure, format and type of data,and it is therefore very difficult for any method to read all the

[(Fig._6)TD$FIG]

Fig. 6. Datasets generated by stratified cross-validation.


available medical databases. It is often necessary to check newdatabases manually so that they are correctly transformed into acsv file.

3.4. Dataset generation

This function defines datasets for experiments. The input is theoriginal database (in csv form) in which the medical information iscontained (cases). The output is a set of different datasets fortraining and testing the system according to the experimentationprocedure (for example, cross-validation).

There are two kinds of procedures to determine the length ofthe test files: fixed (often used by medical users when, for example,10 test cases are required), and percentage-based (usually in cross-validation).

Once the technique for generating the dataset is selected, thesystem generates as many datasets as necessary. For each dataset,four files are produced:

� P

te

atient5 test file: includes cases of patients to be tested.
� C ontrol test file: includes cases of persons that do not have the
illness to be tested.
� T raining file: includes the cases which form the case base. There
are as many patients as healthy people.
� R emaining file: includes the cases that exceed the proportion in
the training file. That is, if there are many more patients thanhealthy people, the generated files are stratified, and someremaining patient cases would not be considered in this set.

Fig. 6 illustrates the process when a stratified cross-validationmethod is used with a fixed length of test files (n cases per file) and afixed number of datasets (m). Note that these datasets are generatedapart from the CBR application. Therefore, it is possible to defineseveral CBR configurations and test them with the same datasets.

3.5. CBR application

The CBR application facility runs the CBR experiment accordingto a given configuration file. As a result, several internal filescontaining the outputs of the application and other result-visualization supporting information are generated, as explainedin the Section 2.5.1.

3.6. Navigation

The navigation facility is perhaps the most innovative aspect ofour framework. It provides a user-friendly way for the user to

5 These are cases in medical terms, but in order to avoid confusion with the CBR

rminology we prefer to use the term ‘‘patient’’.

navigate through the different experiments run so far. Theexperiments performed are listed in the left panel of the navigationwindow according to the file structure presented previously (seeFig. 5). When the user moves the cursor across the experimentname, the corresponding results are shown in the top right panel.The system also displays the dataset and methods used in theselected experiment.

To facilitate choosing from among all the experiments, thenavigator allows users to retain a particular plot while navigatingacross the experiment tree. With this utility, users can explore andgraphically search for experiments they are interested in. Thesefigures (as many as required) can also be expanded and overlappedin a single graphic, as shown in Fig. 7. Each curve is labeled with thename of the experiment. Several types of lines and colors areavailable.

When a figure is expanded, the confusion matrix is also shown.If the figure is related to a cross-validation experiment, theconfusion matrix obtained for each fold (dataset) is shown in a tabstructure.

It is clear that the experiment navigator greatly facilitates thecomparison of experiment results, which speeds up the tuning ofthe diagnostic system parameters.

3.7. Plug-ins

A step before CBR might be needed to add data miningtechniques (as for example, a feature learning mechanism).eXiT*CBR allows to plug-in any technique in the same interface.Fig. 8 shows the effects on the dataset after applying a plug-incalled family risk calculator, which calculates for each case a newattribute related to family information.

4. Application to breast cancer diagnosis

The first application eXiT*CBR has been used with is a breastcancer diagnostic system. The public database, the Breast CancerWisconsin Dataset [32,33], was used to conduct preliminary testsof the framework functionalities.

The CBR system was then developed for our own breast cancerdataset, which consists of 871 cases, 628 corresponding to healthywomen and 243 to women with breast cancer. There are 1199attributes for each case. They correspond to people’s habits(smoker or not, diet, sport habits, etc.), disease characteristics (typeof tumor, etc.), and gynecological history.

A complex database helps to illustrate the capabilities of theplatform when real, complex data are dealt with. In addition, themedical users gathering the data are also involved in developingthe CBR system and the experiments with eXiT*CBR, so they canalso provide useful feedback on the platform.

[(Fig._7)TD$FIG]

Fig. 7. Expansion of different experiments and comparison.

[(Fig._8)TD$FIG]

Fig. 8. Integration of other techniques, such as plug-in, in eXiT*CBR.


4.1. Main steps for developing a CBR application

The process followed to develop a CBR application is illustratedin Fig. 9. The unfilled boxes show optional steps; the filled onesshould be considered mandatory.

First, the program configuration facility is used to define thedestination directory of all the experiments to be carried out withthis database. Then, the original file is converted into a csv file withthe data conversion procedure. Finally, different datasets are built

[(Fig._9)TD$FIG]

Fig. 9. The main steps followed to develop the breast cancer application.

to perform a cross-validation experiment. Now experimentationcan begin. The first CBR system was set up with the editconfiguration facility. The resulting configuration file, seen inFig. 3, shows that the results will be stored in the ‘‘results’’directory, the name of this first experiment is ‘‘original’’, and thedatasets are taken from the directory ‘‘Datasets10’’. Although thename of the class attribute written in the textual file of Fig. 3 iscryptic (A1_1_1#8), note that it has been selected in pop-up menusset up according to the attribute description provided in the csvfile. Other options, such as the different methods involved in eachphase, are also provided via pop-up menus. In the retrieve phase ofthe CBR system, the experiment has been set up with a Euclideandistance function for numerical values, a Hamming distancefunction for categorical ones, the average (mean) as the globalsimilarity measure, and the selection being the ‘‘majority’’ rule inwhich the five most similar cases are selected (five is the parameterof the method). In the reuse phase, the probabilistic-Pous methodis selected, and the threshold required to use it is set to 0.7 (see [34]for further details on this method). In order to generate results andplots in the experiments, information about the variation of theparameter (i.e. the threshold) must also be provided. Assumingthat the parameter varies in the interval [0,1], variations would bestudied from 0.0 with a step of 0.1. Finally, there is no revise phase,

[(Fig._10)TD$FIG]

Fig. 10. First graphical results obtained with the tool. The AUC value has been enlarged for clarity reasons.


since the experiment deals with a cross-validation experience, andthe test cases are not retained in the system.

The default values for the experiment, namely, the true positiveand false positive rates, have been set and ROC curves have beenused as the visualization plots.

The off-line working mode is selected for post-processing,which means that the output will be sent to the user, instead ofbeing reported on-line to another system.

Once the configuration file is defined, it is possible to call theeXiT*CBR core to instantiate the CBR system defined in it, and theCBR application is used to graphically obtain the correspondingresults, as shown in Fig. 10.

4.2. Using the experiment navigator tool

Next, as the original dataset has dependent data, a process tomake cases independent, which is linked to the modify csv facilityis run to remove the dependent cases and obtain a new dataset.
[(Fig._11)TD$FIG]
Fig. 11. Comparison of two experiments through t

Since the csv file changes, the dataset generation facility must berun again to obtain the files for cross-validation (see functionaldependency interactions in Fig. 9).

In the configuration file, it is only necessary to change theinformation about where the new data are, and the name of theexperiment: ‘‘originalIndp’’. The same edit configuration facilityallows the existing configuration file to be loaded and modified.

Note, however, that the information from the previousexperiments is not ‘‘changed’’, even though it seems like the user‘‘updates’’ the files. As previously explained, this is because theeXiT*CBR core always maintains a backup of the different filesinvolved in an experiment.

After the modifications have been made, the experiment is runagain with the new data, and new results are obtained.

The experiment navigator facilitates the interpretation of theresults. The experimental plots can be enlarged with thecorresponding button of the experiment navigator, and comparedas seen in Fig. 11. The user can see whether the new experiment

he experiment navigator enlargement facility.

6 http://protege.stanford.edu/ [accessed 08.04.10].


runs better, and continue to improve the results from thisconfiguration.

4.3. Improving the case base with a data mining tool

If the medical user now realizes that there is additionalinformation in another file regarding family risk that couldimprove the results if incorporated in the data, then a procedurefor learning the family risk is first defined and added to the existingdata as a new attribute for each case. This procedure can be used asa ‘‘jar’’ file directly in eXiT*CBR through the plug-in facility, whichmeans other researchers can also use it.

After adding this information, the experiment is repeated as in4.2 and the results are improved again.

4.4. Changing the experimental parameters

The medical user can try to improve the results by increasingthe ‘‘k’’ parameter of the retrieval method from 5 to 7. Thisexperimental parameter can be modified in the configuration filethanks to the edit configuration facility. The experiment isrepeated and the new results are then compared with the previousones thanks to the navigation tool.

Note then, that all the parameters of the configuration file can bechanged. They include pre-processing (discretization, normalizationand feature selection), retrieval, reuse, revise and retain methods, aswell as visualization and post-processing methods. However, theuse of some of the eXiT*CBR facilities, beyond the scope of theconfiguration file, could involve the generation of a new dataset, inaddition to repetition of the experiment, as shown in Sections 4.2and 4.3. All of them, when based on a cross-validation method, arecomparable, and the navigation tool can help to determine the bestset of parameters for a given application.

4.5. Feedback from medical users

Medical users filled out a questionnaire to evaluate theframework qualitatively. The following conclusions can be drawnfrom their answers.

Several tools currently support breast cancer diagnosis (a recentsurvey can be found in [35]). However, none of them satisfies thecurrent needs of medical users. The Gail model, for example,includes some epidemiological data and a single piece of dataregarding family information. Other models, such as Claus’s model,uses more in-depth family information, while others focus ongenetic information, like the BRCAPro program does. In general, theavailable tools deals with a single model. Conversely, the approachproposed with a CBR method allows medical users to tackle all thedata at once, in an integrated way, dealing with different patternsimplicit in the data to provide a diagnosis.

Visualization tools allow medical users to play with differentparameters (case attributes or variables) of an illness. For example,information regarding family risk can be included and the resultscompared by means of the ROC curves. Future applications arebeing contemplated in the area of preventive medicine, such asadding a new attribute – surgical operation – and testing thebenefits obtained with it.

5. Related work

General CBR tools have been developed in previous studies. Forexample, cF [6] is a Lisp environment designed for rapidprototyping. CBR* [3] and CBR Shell [4] were both designedfollowing an object-oriented method and implemented in Java, likethis system. Thus, the CBR frameworks have inherited themodularity, reusability and extensibility properties of the ob-

ject-oriented paradigm. The main difference between this workand previous approaches is the aim. Our framework is designed toprovide engineers and medical users with navigation functionali-ties across different experiments, to help them choose theappropriate CBR methods and parameters and assure experimentalreproducibility. In addition, we focus on a particular case of CBR:medical diagnosis. CBR is used for classification and the visualiza-tion techniques are mainly based on ROC graphs, the mostcommonly used technique in this domain.

Other interesting CBR frameworks are jCOLIBRI [7] and MyCBR[36]. The jCOLIBRI framework follows a modular approach similar tothe one used here. Itcharacterizes the differentCBR phases accordingto an ontologyand is interesting beacuseit provides researchers withacommonontologytodefineCBRsystems.Thisontologicalapproachwillbetaken intoaccountin future versions ofeXiT*CBR. The scope ofjCOLIBRI and eXiT*CBR, however, are different: like some previousapproaches, jCOLIBRI supports the development of CBR systems,while eXiT*CBR supports experimentation.

MyCBR focuses on defining similarity measures for the CBRretrieval phase. MyCBR is designed as a plug-in in the PROTEGE6

ontology editor. The case representation needed in MyCBR is basedon a csv file, as in the eXiT*CBR platform. However, we includeinformation on the attribute types and weights, while MyCBRassumes that this information will be edited and provided by anontology to be built in PROTEGE. The ontological approach todescribe attributes could be complementary and may be used toextend eXiT*CBR in the future. However, we tried to load the breastcancer database into MyCBR and had some problems with theamount of attributes involved in the data. Therefore, MyCBR couldbe more useful to test some properties of similarity measures thanas an experimental platform for real CBR systems.

Among other general platforms for data mining, Weka [37] alsooffers an experimentation framework. It allows users to work withdifferent kinds of algorithms at the same time, and to compare theresults. However, it is a general purpose environment, unlike thetool presented in this article which uses a specific methodology(CBR) in a specific domain (medicine) and can visualize the resultsin a useful way for the domain it is designed. Moreover, Weka doesnot support a full CBR cycle, only the nearest neighbor approaches(only retrieval).

6. Conclusions

The development of a CBR system for medical diagnosis isalways time-consuming. Although several tools have beendeveloped to facilitate the rapid construction of prototypes, noneof them have been incorporated navigation through experiments.Experimentation is a cornerstone of the empirical approach usedby engineers and medical users to obtain more in-depthknowledge about the data they are working with.

The tool developed here, eXiT*CBR, facilitates interaction withmedical users, which are not lost in the different experimentsperformed. The tool also allows reproducibility, since it facilitatesexperiment replication as many times as necessary. This isimportant when the research results concern medical applications,and could be very useful for the medical community.

In this article we have shown how eXiT*CBR has been applied todevelop a breast cancer CBR diagnostic system. Our next step is tocollect the most successful methods in the field of medicine to testthe system further, and at the same time enrich the platform. Wealso need to study how eXiT*CBR can be applied to tasks other thandiagnosis. For example, it can be appllied to therapy problems,depending on their complexity, by providing the appropriate reusemethod. If the therapy consists of drug adjustments, value

http://protege.stanford.edu/


variations can be computed according to the methods studied in[38,39]. Other complex adaptation processes, such as the oneproposed in [40] for breast cancer treatment, could be studied withthe hybridization and integration capabilities of eXiT*CBR. Somechallenges could be made against the retrieval methods, sincesome authors distinguish more than one retrieval step based oneither individual patient history (as in [41]) or problem types (as in[38]). On the other hand, diagnosis, as shown in this article, is abinary classification problem that can easily be extended into amulti-class problem by combining the results of binary classifiersas in [42]. We have started to develop an application with suchfeatures using eXiT*CBR (see [43]).

In future work, we should also analyze the extensibility ofeXiT*CBR to handle other case representations that allow thedevelopment of structured and conversational CBR systems.Representations should consider graph structures to deal withcomplex medical concepts, as has been done in [44]. Moreover, theparticularities of conversational CBR require the inclusion ofquestion–answer pairs in addition to a textual description ofproblems [45]. Finally, textual CBR systems should also bedeveloped because of their importance in medicine. The worksof [46] could be a good starting point.

Acknowledgments

This research project has been partially funded by the SpanishMEC projects DPI-2005-08922-CO2-02 and CTQ2008-06865-C02-02, the Girona Biomedical Research Institute (IdiBGi) projectGRCT41 and DURSI AGAUR SGRs grants 2005-00296 and 2009-00523 (AEDS). The Wisconsin breast cancer database was obtainedfrom the University of Wisconsin Hospitals, Madison from Dr.William H. Wolberg.

References

[1] Bergmann R, Wilke W, Althoff K, Breen S, Johnston R. Ingredients for develop-ing a case-based reasoning methodology. In: Bergmann R, Wilke W, editors.Proceedings of the 5th German workshop in case-based reasoning(GWCBR’97). 1997. p. 49–58.

[2] Bergmann R, Althoff KD. Methodology for building CBR applications. In: LenzM, Bartsch-Sporl B, Burkhard HD, Wess S, editors. Case-based reasoningtechnology: from foundations to applications. Berlin: Springer; 1998. p.299–326 [Lecture Notes in Computer Science 1400, Chapter 12].

[3] Jaczynski M. A framework for the management of past experiences with time-extended situations. In: Golshani F, Makki K, editors. Proceedings of the sixthinternational conference on information and knowledge management (CIKM).New York: ACM Press; 1997. p. 32–9.

[4] Aitken S. CBR shell java-v1.0. Available from: http://www.aiai.ed.ac.uk/project/cbr/cbrtools.html [accessed 08.04.10].

[5] Bogaerts S, Leake D. IUCBRF: a framework for rapid and modular case-basedreasoning system development. Technical Report TR617, Computer ScienceDepartment, Indiana University; 2005.

[6] Arcos JL. cF development framework. Available from: http://www.iiia.csic.es/�arcos/cbr.html [accessed 29.04.10].

[7] Diaz-Agudo B, Gonzalez-Calero PA, Recio-Garcıa JA, Sanchez-Ruiz-GranadosAA. Building CBR systems with jCOLIBRI. Science of Computer Programming2007;69:68–75.

[8] Bichindaritz I, Marling C. Case-based reasoning in the health sciences: what’snext? Artificial Intelligence in Medicine 2006;(36):127–35.

[9] Bichindaritz I, Montinali S, Portinali L. Special issue on case-based reasoning inthe health sciences. Applied Intelligence 2008;(28):207–9.

[10] Lopez B, Pous C. Applications in medical dababases. In: Saita L, editor.BluePrint in ubiquitous knowledge discovery KDUbic. European project IST-6FP-021321 Coordination Action document; 2007. p. 36–39. Available from:http://www.kdubiq.org/ [accessed 16.06.09].

[11] Cios KJ, Moore GW. Uniqueness of medical data mining. Artificial Intelligencein Medicine 2002;26(1–2):1–24.

[12] Fawcett T. An introduction to ROC analysis. Pattern Recognition Letters2006;27:861–74.

[13] Swets JA, Dawes RM, Monahan J. Better decisions through science. ScientificAmerican )2000;(October).

[14] Drummond C, Holte RC. Explicitly representing expected cost: an alternativeto ROC representation. In: Simoff SJ, Zaıane OR, editors. Proceedings of thesixth ACM SIGKDD international conference on knowledge discovery and datamining. New York: ACM Press; 2000. p. 198–207.

[15] Dietterich TG. Three challenges for machine learning research. In: Plenary talkat The 9th Ibero-American conference on AI (IBERAMIA); 2004.

[16] Aamodt A, Plaza E. Case-based reasoning: foundational issues, methodologicalvariations, and system approaches. AI Communications 1994;7(1):39–59.

[17] Weiss G. Multiagent systems. A modern approach to distributed artificialintelligence. USA: MIT Press; 1999.

[18] Aggour KS, Pavese M, Bonissone PP, Cheetham WE. SOFT-CBR: a self-optimiz-ing fuzzy tool for case-based reasoning. In: Ashley KD, Bridge DG, editors.Case-based reasoning research and development (ICCBR). Berlin: Springer;2003. p. 5–19 [Lecture Notes in Computer Science 2689].

[19] Francisco Nunez HF. Feature weighting in plain case-based reasoning. PhDthesis, Technical University of Catalonia, Spain; 2004.

[20] Martinez T. Seleccio d’atributs i manteniment de la les base de casos per a ladiagnosi de falles. Master’s thesis, Universitat de Girona, Spain; 2007.

[21] Wilson DR, Martinez TR. Improved heterogeneous distance functions. Journalof Artificial Intelligence Research 1997;6:1–34.

[22] Torra V, Narukawa Y. Modeling decisions: information fusion and aggregationoperators. Berlin: Springer; 2007.

[23] Zhang S, Qin Z, Ling CX, Sheng S. ‘‘Missing is useful’’: missing values in cost-sensitive decision trees. IEEE Transactions on Knowledge and Data Engineer-ing 2005;17(12):1689–93.

[24] Montani S. Exploring new roles for case-based reasoning in heterogeneous aisystems for medical decision support. Applied Intelligence 2008;28:275–85.

[25] Bilska-Wolak AO, Floyd CE. Development and evaluation of a case-basedreasoning classifier for prediction of breast biopsy outcome with BI-RADSlexicon. Medical Physics 2002;29(September (9)):2090–100.

[26] Hammond KJ. Explaining and repairing plans that fail. Artificial Intelligence1990;45(1–2):173–228.

[27] Aha DW, Kibler D, Albert M. Instance based learning algorithms. MachineLearning 1991;6:37–66.

[28] Wilson D, Martinez T. Reduction techniques for instance-based learningalgorithms. Machine Learning 2000;38(3):257–86.

[29] Schulz S. CBR-works: a state-of-the-art shell for case-based application build-ing. In: Proceedings of the 7th German workshop on case-based reasoning(GWCBR); 1999. p. 3–5.

[30] Plaza E, McGinty L. Distributed case-based reasoning. The Knowledge Engi-neering Review 2005;20(3):261–5.

[31] Nin J, Herranz J, Torra V. On the disclosure risk of multivariate microaggrega-tion. Data & Knowledge Engineering 2008;67:399–412.

[32] Asuncion A, Newman DJ. UCI machine learning repository. Irvine, CA: Universityof California, School of Information and Computer Science; 2007. Available from:http://www.ics.uci.edu/mlearn/MLRepository.html [accessed 08.04.10].

[33] Mangasarian OL, Wolberg WH. Cancer diagnosis via linear programming.SIAM News 1990;23(September (5)):1–18.

[34] Pous C, Gay P, Pla A, Brunet J, Sanz J, Lopez B. Modeling reuse on case-basereasoning with application to breast cancer diagnosis. In: Dochev D, Pistore M,Traverso P, editors. Artificial intelligence: methodology, systems, and applica-tions (AIMSA). Berlin: Springer; 2008. p. 322–32 [LNAI 5253].

[35] Jacobi CE, de Bock GH, Siegerink B, van Asperen CJ. Differences and similaritiesin breast cancer risk assessment models in clinical practice: which model tochoose? Breast Cancer Research and Treatment 2009;115(2):381–90.

[36] Stahl A, Roth-Berghofer T. Rapid prototyping of CBR applications with the opensource tool myCBR. In: Atholff KD, Bergmann R, Minor M, Hanft A, editors.Advances in case-based reasoning technology (ECCBR). Berlin: Springer; 2008.p. 615–30 [LNAI 5239].

[37] Witten IH, Frank E. Data mining: practical machine learning tools andtechniques, 2nd ed., San Francisco, CA, USA: Morgan Kaufmann; 2005.

[38] Marling C, Shubrook J, Schwartz F. Case-based decision support for patientswith type 1 diabetes on insulin pump therapy. In: Atholff KD, Bergmann R,Minor M, Hanft A, editors. Advances in case-based reasoning technology(ECCBR). Berlin: Springer; 2008. p. 325–39 [LNAI 5239].

[39] Pous C. Case-based reasoning as an extension of fault dictionary methods forlinear electronic analog circuits diagnosis. PhD thesis, University of Girona,Spain; 2004.

[40] Lieber J, d’Aquin M, Badra F, Napoli A. Modeling adaptation of breast cancertreatment decision protocols in the KASIMIR project. Applied Artificial Intelli-gence 2008;(28):261–74.

[41] Vorobieva O, Schmidt R. CBR investigation of therapy inefficacy. In: 2nd work-shop on CBR in the health sciences, Madrid, Spain; 2004. Available from: http://www.cbr-biomed.org/jspw/workshops/ECCBR04.jsp [accessed 17.06.09].

[42] Tax DMJ, Duin RPW. Using two-class classifiers for multiclass classification. In:Kasturi R, Laurendeau D, Suen C, editors. Proceedings 16th InternationalConference on Pattern Recognition (ICPR), vol. 2. Los Alamitos, CA: IEEEComputer Society Press; 2002. p. 124–7.

[43] Lopez B, Pous C, Pla A, Gay P. Boosting CBR agents with genetic algorithms. In:McGinty L, Wilson DC, editors. Case-based reasoning research and develop-ment 8th international conference on case-based reasoning ICCBR, Berlin:Springer; 2009. p. 195–209 [LNAI 5650].

[44] Armengol E, Plaza E. Relational case-based reasoning for carcinogenic activityprediction. Artificial Intelligence Review 2003;20(1–2):121–41.

[45] Aha DW, Breslow LA, Maney T. Supporting conversational case-based reason-ing in an integrated reasoning framework. Applied Intelligence 1998;14:9–32.

[46] Lenz M, Hebner A, Kunze M, Lenz M, Bartsch-Sporl B, Burkhard HD, Wess S.Textual CBR. In: Case-based reasoning technology: from foundations to appli-cations. Berlin: Springer; 1998. p. 115–38 [Lecture Notes in Computer Science1400].

http://www.aiai.ed.ac.uk/project/cbr/cbrtools.html

http://www.aiai.ed.ac.uk/project/cbr/cbrtools.html

http://www.iiia.csic.es/~arcos/cbr.html

http://www.iiia.csic.es/~arcos/cbr.html

http://www.kdubiq.org/

http://www.ics.uci.edu/mlearn/MLRepository.html

http://www.cbr-biomed.org/jspw/workshops/ECCBR04.jsp

http://www.cbr-biomed.org/jspw/workshops/ECCBR04.jsp

exit*cbr: a framework for case-based medical diagnosis development and experimentation

Documents