hypothesis formulation in ontology space · project background report comp60990 student name:...

Project Background Report COMP60990

Student Name: Thamer Omer Ba-Dhfari

Supervisor: Prof. Andy Brass

Date: 9 May 2011

Hypothesis Formulation in Ontology Space Applications to large and complex data from primary care

II

Table of Contents

List of Tables .......................................................................................................... III

Abstract .................................................................................................................. IV

1. Introduction ....................................................................................................... 1

1.1. Need of the Study ......................................................................................... 1

1.2. Report Overview ........................................................................................... 2

2. Background Research ...................................................................................... 2

2.1. Medical Classification and Coding Systems .................................................. 2

2.1.1. Read Codes ............................................................................................... 3

2.2. Pharmacogenomics and Drug Response ...................................................... 4

2.3. Semantic Similarities ..................................................................................... 5

2.3.1. Survey of Semantic Similarities Measures ................................................. 5

2.3.2. Implementation Tools for Semantic Similarity Measures ......................... 12

2.4. Multidimensional scaling ............................................................................. 12

2.4.1. Principal Component Analysis ................................................................. 12

3. Methodology ................................................................................................... 14

3.1. Problem Statement ..................................................................................... 14

3.2. Aims and Objectives ................................................................................... 14

3.3. System Analysis .......................................................................................... 15

3.3.1. Literature Survey...................................................................................... 15

3.3.2. Strategy of Literature Search ................................................................... 16

3.3.3. Sources of Data and Data Collection ....................................................... 16

3.3.4. Project Plan ............................................................................................. 17

3.4. Proposed Work Strategy ............................................................................. 17

3.4.1. Project Data Set ....................................................................................... 18

3.4.2. Semantic similarity ................................................................................... 18

3.4.3. Principal Components .............................................................................. 19

References............................................................................................................ 20

Appendix A – Project Gantt chart .......................................................................... 23

Project Background Report

III

List of Tables

2.1.1 A hierarchy in the Read Codes.......................................................................4


IV

Abstract

IT has contributed in many areas of modern health care systems. One major

breakthrough has been the advent of electronic patient records (EPR), these

records contain valuable information regarding patients‟ history of disease,

medications and laboratory data. Electronic medical records facilitate the

process of storage and retrieval of such information. EPR have now been

adapted in many healthcare systems around the world and the UK was one of

the first adopters of this electronic technology, with which almost all primary

care data is captured and stored in electronic records.

However, these records need more investigation and analysis to give answers

to problems relating to pharmaceutical development. One of these problems

concerns the ability to predict how individuals respond to certain drugs. This

could be done by in depth study of the records of individual patients, but due

to the high complexity and large volumes of these records, new approaches

are needed. Approaches such as semantic similarity and principal

components are used to overcome the problems mentioned above.

In this study, we propose a method that could possibly work for such

problems. The project data set is given by project partners and contains a

large number of patient records; around 250,000. These records are captured

and described in forms of Read codes. The first stage of the proposed method

is to apply semantic similarity measures to map them into a vector space. The

next stage is to apply principal component analysis to the vector space to map

the data into a simple metric space, allowing us to give a visualisation of this

data and giving the ability to apply different data mining approaches.


1

1. Introduction

Patient medical history was recorded in paper-based medical records until they were

recently transformed into computerised records, known as electronic patient records

(EPR). EPR are used to keep track of patient history of diseases and medications,

which can sometimes be effective help in diagnosing diseases based on the history

recorded in these records. Moreover, medical information stored in EPR is complex

and rich; in their website, the health information management systems society

(HIMSS)[1] defines EPR as follows:

“The Electronic Health Record (EHR) is a longitudinal electronic record of

patient health information generated by one or more encounters in any

care delivery setting. Included in this information are patient

demographics, progress notes, problems, medications, vital signs, past

medical history, immunizations, laboratory data and radiology reports.”1

Rector et al. [2] state that EPR contain more than factual information about the

patient, they also include clinician‟s observations at consultations. There are several

coding systems proposed to be used as standards in order to capture, encode and

exchange information between healthcare systems [3]. Read codes are one of these

proposals, and have been used in the UK by almost all primary care practitioners.

General practitioners (GPs) use Read codes to record patients‟ health conditions

during consultation, as well as administrative information used by the National Health

Service (NHS) [4, 5].

1.1. Need of the Study

It is clear that medical records are valuable resources for both the NHS, for planning

future health services, and for researchers for new medical developments. An

adequate understanding of the information included in these records has the

potential to help in discovering and solving many medical problems. In areas such as

drug development, it is essential to know whether a patient has any allergies or even

a genetic variant that could possibly cause an adverse drug reaction [6]. Further

investigation and analysis should therefore be conducted on such medical records.

Due to the structure and high dimensionality of Read codes, it is challenging to

interpret and visualise this data. To achieve this, certain techniques need to be

applied to medical records. One of the most promising techniques that have been

used to date for interpreting medical resources revolves around the use of data

mining notions, such as semantic similarity measures and calculations of principal

components. Both techniques are used in order to map Read codes from their

original structure to a simple metric space, allowing us to conduct more research.

1 N.B. electronic patient records (EPR) are also known as electronic health records (EHR)


2

1.2. Report Overview

This report has been divided into 2 chapters and it is organised in the following way:

Chapter 2 gives a brief overview of the concepts and topics related to the project

domain and describes how likely they are to contribute in solving the study problem.

Different computational procedures are also explained and discussed.

Chapter 3 states the study problem alongside its aims and objectives. It also

discusses the approach that will be used to achieve the study aims and the

implementation methods and tools.

2. Background Research

2.1. Medical Classification and Coding Systems

The main aim of medical records is to keep track of a patient‟s health history and to

store a complete reference of medications being prescribed. These paper-based

records have been transformed later into electronic records known as electronic

medical records (EMR) [7]. EMR provide an ease of use in either storing or retrieving

medical records. On the other hand, medical terms and concepts, being used in

these records and, generally, in clinical domain, expanded continuously. It becomes

more essential to classify the medical terminology in order to find and retrieve easily

a specific term. Medical terms need to be placed into categories or classes, which

provide a structured grouping of terms and concepts organised on the basis of some

common attribute, quality, or property. Generating a classification of clinical terms

will facilitate the communication between different healthcare departments. As a

result, many applications in medicine and medical information such as statistical

analysis of diseases and clinical decision support systems have adapted

successfully different forms of medical classifications systems [3].

Communication between healthcare departments will become more efficient when

using a medical classification system. However, there is a small chance that one

term could be easily understood in different ways from its original meaning.

Therefore, coding systems have been introduced. A medical coding system is

defined as a system responsible for labelling medical terms with a unique code,

which, later, will allow health professionals to recognise and identify the terms

without any confusion [8]. Medical codes can be numeric or alphanumeric, as we

discuss later; these codes include diagnostic, procedural and pharmaceutical terms

[3]. There are several medical coding systems including ICD2, Read codes and

2 ICD: International Classification of Diseases.


3

SNOMED CT3. The World Health Organisation (WHO) published the ICD system,

which encodes diagnostic terminology for all general epidemiological, clinical use

and other health administrative communication terms [9], whereas read codes and

SNOMED CT are intended to record the full detail of medical records of patients [10].

In the following section, we discuss Read codes further.

2.1.1. Read Codes

Read codes are comprehensive and arranged in hierarchically structure. In the UK, Read codes are widely used by almost all general practitioners (GPs) since the Read codes are recommended by the Joint Computing Group of the British Medical Association (CGBMA), Royal College of General Practitioners (RCGP) and Primary Health Care Specialist Group [11]. GPs use this type of medical coding system with the help of their computerised systems, which encode multiple patient details including demographics, lifestyle, symptoms, signs, past history of diseases, family history, diagnostic, therapeutic, procedures, medications and a variety of administrative items [12]. This system enables GPs to make an effective use of computer systems to communicate with other IT system such as hospitals. Also, it provides clinicians with ready access to patient records in order to report, audit, research and clinical decision support [13]. The Read code system has evolved through three major versions. The first version was developed in the early 1980‟s by Dr James Read, a general medical practitioner. This version used alphanumeric codes with four characters and included about 57,128 terms and 40,927 concepts [14]. In 1990, a second version was introduced with the same technical properties as first version except that the code structure was extended to 5-bytes and this version was known, also, as 5-byte Read code. This allowed the system to be able to capture more numbers of concepts and to cover more healthcare areas such as secondary care. Furthermore, in its second version, the Read ode system added case sensitivity to its codes characters (A to Z, a to z and 0 to 9). This led to an expansion of the number of codes stored to reach a total of 125914 terms and 88995 concepts. The third version made an attempt to address some of the technical issues in the earlier versions such as hierarchical relationship between codes. However, in spite of these improvements, GPs still use the second version of read codes [11]. The hierarchy structure of the Read code system reflects the number of levels of

detail. For instance, the 5-byte Read code provides 5 levels of detail in a way that

the code will have more detail whenever it distances itself from the root. In table 1,

we see a sample of 5-byte Read code hierarchy structure encoded with 5

alphanumeric characters to represent a specific type of asthma.

3 SNOMED CT: Systematized Nomenclature of Medicine Clinical Terms.


4

Table 2.1.1 - A hierarchy in the Read Codes [15]

Hierarchy Level

Read Code Term

1 H.... Respiratory system disease 2 H3... Chronic obstructive pulmonary disease 3 H33.. Asthma 4 H331. Intrinsic asthma 5 H3311 Intrinsic asthma with status asthmaticus

2.2. Pharmacogenomics and Drug Response

Pharmacogenomics can be defined as a branch of pharmacology. The term

“pharmacogenomics” comes from two sciences: pharmacology, which is the study of

drug action with living organisms [16], and genomics, which is a discipline in genetics

concerning the study of the entire DNA sequence of organisms and the different

genetic variations [17]. Pharmacogenomics combined these two sciences and aimed

to study different drug responses, which might occur in particular individuals or

ethnic groups depending on their genetic makeup [18]. Furthermore, the evolution of

pharmacogenomics is improving drug actions such as drug response, drug targeting,

drug metabolism and drug development [19]. In addition, other techniques such as

genotyping are used beside pharmacogenomics. Genotyping techniques including

single-nucleotide polymorphism (SNPs) and copy number variation (CNVs) are being

used to understand common genetic factors responsible for different responses to

drugs. However, it is possible that other non-genetic factors, like environmental

factors, also, might affect drug response, which needs further study [20].

Pharmacogenomics have been applied and used in some serious diseases such as

cancer, HIV, diabetes and cardiovascular diseases (CVDs). For example, in CVDs,

pharmacogenomics is concerned with the study of drug metabolising. Specific

enzymes are being investigated like the cytochrome P450 (CYP). In general, CYP

enzymes are responsible for metabolising different classes of drugs including CV

drugs [21]. A better understating of variations in these enzymes might provide a

clearer picture of the variability in drug responses and might lead to new inventions

in drug discovery.

Despite the promise that pharmacogenomics has provided in both pharmacology

and genetic sciences, there are still a few areas in some serious diseases yet to be

discovered. On the other hand, challenges and questions have been asked

regarding whether or not pharmacogenomics would be a suitable method to

prescribe medication based on specific variants in a patient‟s genes instead of the

ordinary way of prescribing. Other issues concern the cost-effectiveness of

pharmacogenomics. For example, when prescribing a drug such as warfarin, which

could have a serious adverse reaction on a patient, genetic tests need to be done in

order to decide the recommended dosage; these tests might be expensive and time-


5

consuming. Others argue that these tests might reduce the hospitalisation of patients

and, therefore, pharmacogenomics would be cost-effective for both the patient and

the healthcare system [22].

2.3. Semantic Similarities

Semantic similarity is used to calculate how likely it is that concepts or terms are

similar to each other in terms of their meaning or content. It is commonly used for

ontology learning and information retrieval [23, 24]. Currently, one of the most

important applications of semantic similarities is Gene Ontology (GO)4. GO semantic

similarity is used to compare genes and proteins based on the similarity of their

functions [25-28]. Another application of semantic similarities is WordNet5, which is a

lexical database for the English language [29]. WordNet organises the words into

sets of synonyms called synsets and gives the semantic relation between those sets

[30, 31].

Various measures have been proposed in the literature to determine semantic

similarities either between two concepts or between sets of concepts. Measures

used to compare two concepts are classified into two main categories: edge-based

measures and node-based measures, and there are also hybrids measures. In the

following section we present these different measures along with a brief discussion

of their properties.

2.3.1. Survey of Semantic Similarities Measures

Measures between two concepts

o Edge based measures

In an edge or distance based approach, links and types of terms are

considered to be the data sources. In order to calculate the similarity,

this approach depends on the depth of the term in the hierarchy. In

other words, measures, which follow this approach, select from all the

possible paths the shortest path between the terms. Since these

measures depend on the paths and the structural part of the taxonomy,

it becomes difficult to implement such measures in some fields such as

lexical databases.

Researchers have proposed several measures including Wu and

Palmer‟s similarity measure, Leacock and Chodorow‟s similarity

measure, Pozo‟s measure and the IntelliGO measure.

4 http://www.geneontology.org/

5 http://wordnet.princeton.edu/


6

Wu & Palmer similarity measure (WP Measure)

This measure is based on the number of „is-a‟ relation between

concepts and their most informative ancestor [32]. In this

measure, the depth of the least common subsume (LCS) is

divided by the depth of both terms. The expected results of this

similarity measure range between 0 and 1. It is measured as

follows:

Leacock and Chodorow similarity measures (LC measure)

This measure is based on a calculation of the length of the

shortest path between concepts and the maximum depth in the

taxonomy [33]. It is measured as follows:

o Node based measures

This approach ignores the position of the node and regards the

contents of the node as the data source along with its properties. The

similarity of two terms is computed as a combination of common and

distinctive features of both entities. This approach was used by such as

Resnik, Lin, and Jiang and Conrath in their measures. These measures

are information content (IC) based, and the calculation of the similarity

depends on the frequencies of the two terms involved and that of their

most informative common ancestor (MICA). In contrast, measures such

as GraSM (Graph-based Similarity Measure), which follow different

ways are proposed. GraSM considers the average IC of all disjointed

common ancestors. The next section briefly explains these methods.

Resnik’s Measure

Resnik‟s measures [34] rely on information content (IC). These

measures calculate similarity between two terms by finding IC of

their most informative common ancestor (MICA). The equation

of this metric is as follows:

simRes(c1, c2)= IC (cMICA)

Where: IC(c) = - log P(c)


7

where P(c) is the probability of the frequency of term c. To get

this frequency, the following should be done

P(c) =

Here, each term is counted along with terms annotated to it or

one of its descendent terms. Then, these are divided by N,

which provides the total number of all terms in the data set

being compared.

The minimum value for it is 0 whilst there is no maximum value.

However, when the MICA of both terms is identical, the

similarity may be the same, which can be considered as

deficiency of this approach [35].

Jiang and Conrath Measure (JC Measure)

After reviewing the deficiency in the previous measure, Jiang

and Conrath [36] suggested a semantic measure. This measure

calculates not only the IC but, also, every concept‟s IC. The

results show the semantic distance rather than the semantic

similarity using the following equation:

As semantic similarity and semantic distance are inverse

relationships, the bigger distance of the concepts means that

there are fewer similarities between the concepts.

A related similarity measure could be derived from the previous

measure as follows:

Lin’s semantic measures

This measure uses the elements of the previous measure. Lin

[37] considers that the distance between the terms and their

common ancestors relate to the distance between the IC of

MICA and the IC of the code. Lin‟s measure is as follows:


8

The results of this measure ranges between 0 and 1. Also, the

results of the calculated concepts indicate the ratio of the

information shared in common to the IC of concepts being

computed.

GraSM (Graph-based Similarity Measure)

Couto et al. [27] proposed a new approach to find semantic

similarities known as GraSM. This approach avoids using the

concepts‟ most informative ancestor, and, instead, it assumes

that two common ancestors are disjunctive if there are

independent paths from both ancestors to the concept. GraSM

produces lower semantic similarity values since it considers the

average IC of all disjoint common ancestors.

GraSM considers that a1 and a2 represent disjunctive ancestors

of c if there is a path from a1 to c not containing a2 and a path

from a2 to c not containing a1. This could be represented by:

Given two concepts c1 and c2, their common disjunctive

ancestors are the most informative common ancestor of

disjunctive ancestors of c1 and c2, i.e., a1 is a common

disjunctive ancestor of c1 and c2, if for each ancestor, a2 is more

informative than a1, a1 and a2, which are a disjunctive ancestor

of c1 or c2. The equation is as follows:

GraSM defines the shared information of both concepts, c1 and

c2, as the average of IC of their common disjunctive ancestors.

The equation is as follows:


9

Pesquita et al. [28] showed that using GraSM along with

measures including Jiang and Conrath's, Lin's and Resnik's

resulted in increased performance of semantic similarities in

some data sets, while using data sets could give inconclusive

values.

o Hybrid Measures

Some measures have been proposed in which they derive their

functionality from both edge-based and node-based approaches. They

combine the advantages of both approaches. However, in some cases,

their accuracy might not be close to that perceived by people [35].

Othman’s Measures

Othman [38] proposed a semantic similarity metric, which is a

combination of distance measure and node content. In this

measure edges are weighted by the depth of the node and the

difference in IC between the nodes linked by that edge.

Othman‟s measures are as follows:

Wang’s Measure

This measures semantic similarity values by comparing the

locations of two terms and their semantic relations with their

ancestor terms [39]. The calculation of semantic similarity is

done by adding up the semantic contributions of all common

ancestors in each of the terms and dividing by the total semantic

contribution of each term‟s ancestors to that term. This measure

has been applied successfully to gene ontology (GO). In GO,

terms are represented as directed acyclic graph (DAG), For

example, term A is represented as DAGA = (A, TA, EA), where TA


10

is the set of terms in DAGA, and EA is the set of edges

connecting the GO terms in DAGA. For any of term t in DAGA =

(A, TA, EA), its semantic value to term A → SA(t) is defined as

follows:

Where term A contributes to its own as 1 (SA(t) = 1) and, we is

the semantic contribution factor for the edge e EA linking term t

with its child term t‟. When getting SA(t) value for all terms in

DAGA, the semantic value of term A and the semantic value

SV(A) is calculated as follows:

Given DAGA = (A, TA, EA) and DAGB = (B, TB, EB) for GO terms,

A and B respectively, the semantic similarity between these two

terms, S(A, B), is calculated as:

Where SA(t) is the semantic value of Go term t related to the

term A, and SB(t) is the semantic value of Go term t related to

the term B.

Measures between two sets of concepts

o Pair-wise

In a pair-wise approach, terms of both objects are paired together and,

then, followed by semantic similarity calculations. Common measures,

using this approach, include an arithmetic average approach,

maximum approach and best-match average.

Arithmetic Average (AVG) approach

In AVG approach [25, 26], the semantic similarity is computed

after pairing all terms of concepts. It uses the following formula:


11

Maximum (MAX) approach

The similarity calculation process in this approach [25] is

obtained by calculating the maximum similarity between each

term of both sets (A, B) calculated as follows:

Best-Match Average (BMA) approach

In BMA [40], calculations of similarity are done by comparing

every term in first set (A) with a similar term in the second set

(B). The results of these calculations show the best match

average between these two sets. BMA is calculated using the

following formula:

Pesquita et al [28, 41]. suggested that the BMA approach could

be the best among the other pair -wise approaches because it

provided a good balance between MAX and AVG approaches

by considering all terms and not only the most significant

matching.

In summary, there is a considerable amount of semantic similarity measures has

been developed. In order to select certain measure, Pesquita et al. [41] have

discussed in their paper this problem and identify three steps. Firstly, choosing the

correct measure for similarity calculation is based on whether the comparison

between two concepts or two sets of concepts. Secondly, measures are different in

deciding the level of details. For example, measures based on graph such as GraSM

are used in generalised similarity, whereas measures such as best-match average

are used for finding more details in similarity between concepts. Thirdly, by detailed

analysis of given data set terms and sets (or bags) of concepts, the researcher could

decide the proper measure based on results of computations, this could be done by

using some tools that implement different semantic similarity measures at once.


12

2.3.2. Implementation Tools for Semantic Similarity Measures

Many tools have been developed to calculate semantic similarities measures. Such

tools are divided mainly into three categories: web tools, standalone tools and R

packages. Firstly, web tools are used to compare and calculate semantic similarities

in a simple way without requiring any maintenance or updating, though they offer

only certain options. These tools include FuSSiMeG6, ProteInOn7, G-SESAME8 and

FunSimMat9.

Secondly, standalone tools are more stable and capable than web tools in

calculating complex computations. In comparison to web tools, standalone

applications require local installation and regular updates and maintenance.

Standalone applications include DynGO and UTMGO tool.

R packages are the third type of tools used to find semantic similarities. This type of

tools has the possibility of being embedded with other packages such as

visualization tools or statistical analysis. Examples of these packages include tools

provided by the Bioconductor project such as SemSim10, GOvis11 and csbl.go12.

2.4. Multidimensional scaling

Multidimensional scaling (often abbreviated to MDS) encompasses statistical

methods to reduce the dimensionality of given data sets. This is done by mapping

the distances between data in high dimension space into a lower dimension space.

MDS searches for configuration points in low dimension space, which represent

certain objects, the distances between these points and the corresponding

dissimilarities in the high dimension space [42]. Possible applications of MDS

include data visualisation to identify similarities or dissimilarities in data and by

applying different data mining methods. Different models including metric

multidimensional scaling and non–metric multidimensional scaling have been used

to search for the space and the associated configuration points [43].

2.4.1. Principal Component Analysis

Alternatively, due to the similarity in their work to MDS, some related methods such

as principal components analysis (PCA) have been used. PCA is an effective

statistical tool for dimension reduction of a given high-dimensional dataset. It is used

for data visualization, compression and feature selection and feature extraction [43].

6 http://xldb.fc.ul.pt/rebil/ssm

7 http://xldb.fc.ul.pt/biotools/proteinon

8 http://bioinformatics.clemson.edu/G-SESAME

9 http://funsimmat.bioinf.mpi-inf.mpg.de

10 http://www.bioconductor.org/packages/ 2.2/bioc/html/SemSim.html

11 http://bioconductor.org/ packages/2.3/bioc/html/GOstats.html

12 http://csbi.ltdk.helsinki.fi/csbl.go/


13

Furthermore, PCA is considered to be a powerful technique for analysing data due to

it is simplicity as a non-parametric method for retrieving information from datasets.

By applying PCA on a particular dataset, we could identify easily the underlying

factors rather than the observed data and the similarities patterns would become

clearer by visualisation [44].

In order to extract the important information from the given dataset, PCA calculates a

set of new values called principal components (PCs), which are obtained as linear

combinations, the dataset uncorrelated variables and observations [45]. Multiple

steps are followed towards calculations of principal components [46]. Firstly, we

provide a dataset; for the purpose of this study, we are given a data set of data

points in dimensional space and . The next step is

to subtract the mean of both and , where all the values have subtracted, and

all the values have subtracted from them. The equation is as follows:

Next, the covariance matrix for the data set with dimensions is calculated using the

following formula:

Where is a matrix with rows and columns, and and is the

dimension.

And

After calculating the covariance matrix, the next step is to find the eigenvectors and

eigenvalues of this matrix. Once eigenvectors are found, they are placed in order by

eigenvalue, highest to lowest where the highest eigenvalue is the first PC of the

dataset. The second PC is computed in such a way that it should be orthogonal to

the first PC and have the next highest eigenvalue; the other principal components

are computed likewise [45]. There are principal components ( ); in order to

be more precious in the results it is important to find all principal components.

However, this could lead to ignoring important information.

There are a number of PCA limitations such as issues related to reduced dimensions

and whether or not PCA is statistically independent. Therefore, several extensions

such as independent component analysis (ICA), principal coordinate analysis (PCA)

and kernel PCA have been proposed to overcome such limitations. ICA has been

tried and tested in different applications where standard PCA has shown insufficient

results such as signal and image processing [44].


14

PCA can be implemented through different numerical computation tools such as

ViSta13, Scilab14, GNU Octave15 and Weka16, which are free software, and the

commercial statistical software MATLAB17.

3. Methodology

3.1. Problem Statement

Many questions have been posed after the introduction of the concept of

personalised medicine regarding whether or not the traditional way of prescribing

medications is still useful after finding that some treatments could cause severe drug

reactions on a particular group of people [21]. One way to avoid this is by delivering

the right dose of the drug tailored to an individual‟s genetic makeup [47]. This

problem is one of the challenges facing modern day pharmaceutical development.

On the other hand, data in health care systems can have an important impact on

solving such a problem. However, the complexity and the large volumes of medical

data could be an obstacle in terms of interpreting them. Therefore, adapting

approaches from data mining such as semantic similarity and principal components

that show promising results [26, 48, 49] are worth considering in order to come up

with an effective solution for predicting possible drug responses. These two

approaches will be implemented with regard to medical data by mapping them from

an ontology space to a simple metric space to provide ways of effectively visualising

and applying data mining techniques.

3.2. Aims and Objectives

The main aim of the present study is to provide an insight into the invaluable

resources in data that is recorded by primary care health professionals for

developing better ways of predicting how patients respond to particular teatments.

This will lead to the maxmisation of the effectiveness of medication by tailoring

dosages to a particular patient's specific needs. Subsidiary aims will be to evaluate

and validate whether using usful computer science approaches such as semantic

similarity measures and principal component analysis can be promising when it

comes to interpreting such large and complex data that emerges from the work of

GPs, and allowing both visualisation and data mining to be applied.

13

http://forrest.psych.unc.edu/research/index.html 14

http://www.scilab.org/ 15

http://www.gnu.org/software/octave/ 16

http://www.cs.waikato.ac.nz/~ml/weka/ 17

http://www.mathworks.com/products/matlab/


15

In order to achieve the main aims of the study, we will attempt to achieve the

following objectives:

Developing a better understanding of health care systems and how medical

data is captured, exchanged and encoded.

Understanding the concept of pharmacogenomics and its related applications

to personalized medicine.

Analysing the given data set with regard to its large size and complexity and

its medical coding schema.

Drawing up a work strategy for mapping the project data set.

Implementing and testing the proposed strategy.

Evaluating the performance of the proposed system

Improving of the work strategy based on results and feedback.

Additionally, subsidiary aims are achieved in terms of the following objectives:

Exploring strategies such as SSMs and PCA in order to apply them to the

given data set for hypothesis formulation and data mining.

Undertaking a literature review of the applications of SSMs in Gene Ontology

(GO) and WordNet.

Reviewing the existing semantic similarity metrics to determine those most

appropriate for implementation on the project data set.

Collecting the existing implementation tools for both SSMs and PCA and

determining the most appropriate tools.

3.3. System Analysis

3.3.1. Literature Survey

In healthcare systems, data such as medical records is generated in large volumes

and stored in complex forms. As a result, researchers have found some difficulties in

analysing and investigating such resources. As mentioned earlier, computer science

techniques, on the other hand, have been used in several applications that deal with

huge amounts of data. One of these techniques is known as semantic similarities

measures (SSMs). SSMs are used to map a set of documents or terms into a metric

space based on the likeness of their meaning or semantic content. They have been

successfully applied and used in biomedical ontologies. Another technique is called

principal component analysis (PCA). This is mainly used in the context of this

project as a dimension reduction tool.

A considerable amount of literature has been published on SSMs and PCA. The first

stage of this study involved two literature reviews. Firstly, a systematic review of the

current literature was conducted to identify the existing metrics of semantic

similarities and the ways they are applied. Following the first literature review was

the collection and analysis of studies in order to evaluate a suitable measure to be


16

implemented on the given data set. This review also included research into the uses

of PCA on high-dimensional data, and a consideration of its implementation tools in

order to determine the software that should be used. Secondly, a survey of literature

relating to medical terminologies, classifications and coding systems was undertaken

in order to fully understand the nature of the medical records in primary care and

how they are organised and structured.

3.3.2. Strategy of Literature Search

The criteria for accepting literature on SSMs were specifically focus on publications

about semantic similarity metrics applied to fields such as gene ontology (GO)18 or

linguistics (WordNet)19. The strategy for accepting these publications was based on

the publication‟s date which was restricted to those published from 2000 onwards.

Other publications were excluded. However, some articles and books relating to

PCA were accepted since this mathematical procedure was developed prior to 2000.

Other criteria have been adopted with regard to searching for, and retrieving,

appropriate medical studies relating to different medical coding systems, focusing

mainly on Read code systems and excluding studies of other coding systems from

the search.

3.3.3. Sources of Data and Data Collection

The sources of data for this study were both primary and secondary. This work

attempted to identify existing knowledge related to SSMs, PCA and Read codes. The

data collection was done by interview with the project partners and a survey of the

available literature.

Primary Data

The collection of the primary data will take place in the second stage of this

study20. The project data set will be obtained from the project partners.

Multiple methods of data collection such as interviews, personal discussions

and observations will be used.

Secondary Data

The process of collecting secondary data took place in the first stage of the

study. Different online databases and libraries were used to review the

literature. These included the John Rylands University Library catalogue21, the

U.S. National Library of Medicine (PubMed)22, the ISI Web of Knowledge23

18

http://www.geneontology.org/ 19

http://wordnet.princeton.edu/ 20

See project plan section. 21

http://catalogue.library.manchester.ac.uk/ 22

http://www.ncbi.nlm.nih.gov/pubmed/ 23

http://wok.mimas.ac.uk/


17

and the ACM Digital Library24. These databases were searched using the

following search terms: "semantic similarities (measures/metrics/ scores)",

"semantic similarities in gene ontology", “principal component analysis”,

“medical (terminology/classification/coding) systems” and “Read codes”. Also,

hand-searching of related journals and reports was done using Google

Scholar using terms corresponding to those listed above.

3.3.4. Project Plan

The project has two main stages (February 2011 to September 2011). The first stage

was the period from February through to the end of May. In this period we were

supposed to work on three assignments for research methods and the professional

skills course. These assignments included project statements, project plans and a

project website. This stage was intended to create effective and solid background

research. Working on these pieces of coursework also helped in planning and

collecting relevant literature regarding the study topic. On 9th May we are required to

submit a background report containing a review of the literature relating to the project

topic, along with a description of purpose, objectives and the deliverables of the

project.

The second stage starts in June and ends in September with the submission of the

dissertation. After reviewing existing techniques in the first period, the second stage

will be allocated to the implementation of the proposed work strategy. In this stage,

there will be some visits to AstraZeneca Research to receive the project data set that

will help in implementing and testing the proposed methods, and to accomplish the

study objectives. The process of writing the final study report will probably start

during the period from July to September. It will include regular meetings with the

academic supervisor to demonstrate the progress of the work. Full details of the

different project tasks can be found in Appendix A, wherein the plan of this project

has been represented using a Gantt chart.

3.4. Proposed Work Strategy

To achieve the main study aims the following steps are planned:

Step 1: Apply SSM on the data set to map it into semantic space as follows:

Apply the measure to the individual Read codes of a patient, and then

compute the semantic similarity of two sets of the Read code. By repeating

this process on each patient‟s record, a vector of similarities for each patient

will be generated. Another semantic similarity calculation will take place

between two patient records. These two calculations will result in mapping the

data into a vector space.

24

http://portal.acm.org


18

Step 2: Apply PCA on the data set to reduce the dimensionality as follows:

After mapping patients‟ records into a vector of similarities, we can easily find

the principal components by performing PCA. As a result, data will be mapped

to a low dimensional vector space. Here we can readily use and perform

different machine learning techniques and can visualise this data.

Further details will be described in the following sections.

3.4.1. Project Data Set

The data set obtained from the project partners contains GPs records from Salford, a

city in the North West of England. This data contains anonymised patient records for

nearly a quarter of a million individuals; the data set is huge and complex and

requires many processes in order to apply the proposed methods to it. These

records are captured and described by one of clinical coding scheme such as Read

codes. The structure of Read code records will help us to implement techniques

such as SSMs and PCA, since the given records are encoded in such a way that

each patient is associated with a set of Read codes during a given period of time.

3.4.2. Semantic similarity

Applying semantic similarity measures on the project data set is done through two

steps:

Compute similarity between two Read code of one patient

Here, we compare two Read codes of individual patient to

find the semantic similarity by using GraSm [27] approach with Resnik‟s

measure [34] 25. Resnik defines their measure as the information content (IC)

of the most informative common ancestor (MICA):

Where

We can find the common ancestor of the two concepts c and c by the

following:

The IC of code can be defined as the following:

25

See “Survey of Semantic Similarities Measures” section for more information about both measures.


19

where

This process will be applied on each record of a patient.

Compute similarity between two patient records:

In this step we will use arithmetic average (AVG) approach to find the

similarity between two patient records, given that

The AVG measure is calculated as follows:

where

where is the similarity between two patient records and .

We calculate the average of the similarity using the measure mentioned in

step 1.

By this stage, we can represent the similarities of patient records obtained

above into ontology space in order to apply the principal component.

3.4.3. Principal Components

In order to achieve the main aims of this study, further calculations should be done

on the generated vector space from the previous step. In other words, the patient

records, which are mapped into similarity space, are still represented as high

dimensional space and we could not perform either machine learning techniques nor

could we visualise them. As discussed earlier in the background research section of

this report, PCA is used in such cases since it offers a dimensionality reduction and

gives a better understanding of variance and the structure of the data [50].

Finally, the data set will be ready by this stage. We will then be able to implement

different data mining methods and also will be able to project these data into a

simple metric space for visualisation.


20

References

1. Healthcare Information and Management Systems Society. Electronic Health Record. Available from: http://www.himss.org/ASP/topics_ehr.asp [Accessed April 2011]

2. Rector, A., W. Nowlan, and S. Kay. Foundations for an electronic medical record. Methods Inf Med, 1991. 30(3): p. 179-186.

3. Benson, T., Principles of health interoperability HL7 and SNOMED. 2010, London: Springer Verlag.

4. Gillam, S. and A.N. Siriwardena, The Quality and Outcomes Framework: Qof-Transforming General Practice. 2010: Radcliffe Publishing.

5. NHS Confederation and British Medical Association. Investing in general practice: the new GMS contract. British Medical Association, 2003.

6. Nebeker, J.R., P. Barach, and M.H. Samore. Clarifying Adverse Drug Events: A Clinician's Guide to Terminology, Documentation, and Reporting. Annals of Internal Medicine, 2004. 140(10): p. 795-801.

7. Shortliffe, E.H. The evolution of electronic medical records. Academic Medicine, 1999. 74(4): p. 414-419.

8. Coiera, E., Guide to Medical Informatics, the Internet and Telemedicine. 1997: Chapman and Hall, Ltd. 376.

9. World Health Organization (WHO). (WHO), International Classification of Diseases (ICD). 2011. Available from: http://www.who.int/classifications/icd/en/ [Accessed 30 April 2011]

10. Cimino, J. Review paper: coding systems in health care. Methods Inf Med, 1996. 35(4-5): p. 273-84.

11. de Lusignan, S. Codes, classifications, terminologies and nomenclatures: definition, development and application in practice. Informatics in Primary Care, 2005. 13(1): p. 65-70.

12. House of Commons: Health Committee, The electronic patient record: Sixth Report of Session 2006-07. 2007.

13. Booth, N. What are the Read Codes? Health Libraries Review, 1994. 11(3): p. 177-182.

14. Bentley, T.E., C. Price, and P.J.B. Brown. Structural and lexical features of successive versions of the Read Codes. in proceedings of the 1996 Annual Conference of the Primary Health Care Specialist Group of the British Computer Society. 1996. Cambridge: UK.

15. Scottish Clinical Information Management in Primary Care. Read Codes "Making IT Work For You" Good Practice Guide (GPG) : RCGP and CEPpc. 2003. Available from: http://www.scimp.scot.nhs.uk/gpg/doc_page67.shtml [Accessed 25 April 2011]

16. Vallance, P. and T.G. Smart. The future of pharmacology. British Journal of Pharmacology, 2006. 147(S1): p. S304-S307.

17. Griffiths, W.M., et al., An introduction to genetic analysis. 2000, New York, USA: WH Freeman and Company.

18. U.S. Department of Energy Genome Program's Biological and Environmental Research Information System (BERIS). Pharmacogenomics. 08 September 2010. Available from: http://www.ornl.gov/sci/techresources/Human_Genome/medicine/pharma.shtml [Accessed 25 April 2011]

http://www.himss.org/ASP/topics_ehr.asp

http://www.who.int/classifications/icd/en/

http://www.scimp.scot.nhs.uk/gpg/doc_page67.shtml

http://www.ornl.gov/sci/techresources/Human_Genome/medicine/pharma.shtml

http://www.ornl.gov/sci/techresources/Human_Genome/medicine/pharma.shtml


21

19. Centre for Genetics Education. PHARMACOGENETICS/PHARMACOGENOMICS. 2007 June 2007. Available from: http://www.genetics.edu.au/factsheet/fs25 [Accessed April 2011]

20. Zhang, W., R. Huang, and M. Dolan. Integrating epigenomics into pharmacogenomic studies. Pharmacogenomics and personalized medicine, 2008(1): p. 7-14.

21. Barone, C., S.S. Mousa, and S.A. Mousa. Pharmacogenomics in cardiovascular disorders: Steps in approaching personalized medicine in cardiovascular medicine. Pharmacogenomics and personalized medicine, 2009. 2: p. 59-67.

22. Ginsburg, G.S., M.P. Donahue, and L.K. Newby. Prospects for Personalized Cardiovascular Medicine: The Impact of Genomics. Journal of the American College of Cardiology, 2005. 46(9): p. 1615-1627.

23. Cimiano, P., A. Hotho, and S. Staab. Learning concept hierarchies from text corpora using formal concept analysis. J. Artif. Int. Res., 2005. 24(1): p. 305-339.

24. Muller, C., I. Gurevych, and M. Muhlhauser, Integrating Semantic Knowledge into Text Similarity and Information Retrieval, in Proceedings of the International Conference on Semantic Computing. 2007, IEEE Computer Society. p. 257-264.

25. Lord, P., R. Stevens, and A. Brass. Semantic similarity measures as tools for exploring the gene ontology. in The 8th Pacific Symposium on Bio-computing 2003. 2003.

26. Lord, P., et al. Investigating semantic similarity measures across the Gene Ontology: the relationship between sequence and annotation. Bioinformatics, 2003. 19(10): p. 1275 - 83.

27. Couto, F.M., M.J. Silva, and P.M. Coutinho. Measuring semantic similarity between Gene Ontology terms. Data & Knowledge Engineering, 2007. 61(1): p. 137-152.

28. Pesquita, C., et al. Metrics for GO based protein semantic similarity: a systematic evaluation. BMC Bioinformatics, 2008. 9(Suppl 5): p. S4.

29. Fellbaum, C., Wordnet: an electronic lexical database. 1998, Cambridge: MIT Press.

30. Li, Y., Z.A. Bandar, and D. McLean. An Approach for Measuring Semantic Similarity between Words Using Multiple Information Sources. IEEE Trans. on Knowl. and Data Eng., 2003. 15(4): p. 871-882.

31. Varelas, G., et al., Semantic similarity methods in wordNet and their application to information retrieval on the web, in Proceedings of the 7th annual ACM international workshop on Web information and data management. 2005, ACM: Bremen, Germany. p. 10-16.

32. Wu, Z. and M. Palmer, Verbs semantics and lexical selection, in Proceedings of the 32nd annual meeting on Association for Computational Linguistics. 1994, Association for Computational Linguistics: Las Cruces, New Mexico.

33. Leacock, C. and M. Chodorow. Combining Local Context and WordNet Similarity for Word Sense Identification. WordNet: A Lexical Reference System and its Application, 1998: p. 265-283.

34. Resnik, P. Semantic similarity in a taxonomy: An information-based measure and its application to problems of ambiguity in natural language. Journal of Artificial Intelligence Research, 1999. 11: p. 95 - 130.

http://www.genetics.edu.au/factsheet/fs25


22

35. Songmei, C. and L. Zhao, An Improved Semantic Similarity Measure for Word Pairs, in Proceedings of the 2010 International Conference on e-Education, e-Business, e-Management and e-Learning. 2010, IEEE Computer Society.

36. Jiang, J.J. and D.W. Conrath. Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy. in International Conference Research on Computational Linguistics (ROCLING X). 1997.

37. Lin, D. An Information-Theoretic Definition of Similarity. in Proc. International Conference on Machine Learning (ICML). 1998.

38. Othman, R.M., S. Deris, and R.M. Illias. A genetic similarity algorithm for searching the Gene Ontology terms and annotating anonymous protein sequences. Journal of Biomedical Informatics, 2008. 41(1): p. 65-81.

39. Wang, J.Z., et al. A new method to measure the semantic similarity of GO term. Bioinformatics, 2007. 23(10): p. 1274 - 1281.

40. Schlicker, A., et al. A new measure for functional similarity of gene products based on Gene Ontology. BMC Bioinformatics, 2006. 7: p. 302.

41. Pesquita, C., et al. Semantic Similarity in Biomedical Ontologies. PLoS Comput Biol, 2009. 5(7): p. e1000443.

42. Cox, T. and M. Cox, Multidimensional Scaling, Second Edition. 2000: Chapman & Hall/CRC.

43. Borg, I. and P.J.F. Groenen, Modern multidimensional scaling: Theory and applications. 2005: Springer.

44. Shlens, J. A tutorial on principal component analysis. Systems Neurobiology Laboratory, University of California at San Diego, 2005.

45. Abdi, H. and L.J. Williams. Principal component analysis. Wiley Interdisciplinary Reviews: Computational Statistics, 2010. 2(4): p. 433-459.

46. Smith, L.I. A tutorial on principal components analysis. Cornell University, USA, 2002. 52.

47. Sadée, W. and Z. Dai. Pharmacogenetics/genomics and personalized medicine. Human Molecular Genetics, 2005. 14(suppl 2): p. R207-R214.

48. Couto, F., M. Silva, and P. Coutinho. Semantic similarity over the gene ontology: family correlation and selecting disjunctive ancestors. In CIKM '05: Proceedings of the 14th ACM international conference on Information and knowledge management, 2005: p. 343 - 344.

49. Pedersen, T., S. Patwardhan, and J. Michelizzi, WordNet::Similarity: measuring the relatedness of concepts, in Demonstration Papers at HLT-NAACL 2004. 2004, Association for Computational Linguistics: Boston, Massachusetts. p. 38-41.

50. Jolliffe, I.T., Principal component analysis. Vol. 2. 2002: Springer Series in Statistics. 487.


23

Appendix A – Project Gantt chart

hypothesis formulation in ontology space · project background report comp60990 student name:...

Documents