abstractdnahelix.wdfiles.com/local--files/phd-research-proposal… · web view2007/07/18  ·...

25
Thesis Proposal (Warren Cheung) Page 1 of 25 document.doc To think About: Improve Figures Coordinate terminology (relationship/association/link) Other Evidence sources Gene -> PubMed Articles -> Disease Gene -> Orthologous Gene -> PubMed Articles -> Disease Gene -> Interacting/Regulated Gene -> PubMed Articles - > Disease Gene -> Structural Element -> PubMed Articles Gene -> Function -> Related Genes -> PubMed Articles -> Disease Better word choice? Redefine Evidence as Property? Entity == ? Linkage/Association/Relationship/… Extraction and Evaluation of Transcription Factor Gene-Disease Association Thesis Proposal for Doctor of Philosophy Warren Cheung Supervised by Francis Ouellette Wyeth Wasserman

Upload: others

Post on 26-Jun-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Abstractdnahelix.wdfiles.com/local--files/phd-research-proposal… · Web view2007/07/18  · Supervised by Francis Ouellette Wyeth Wasserman Table of Contents Table of Contents 2

Thesis Proposal (Warren Cheung) Page 1 of 17document.doc

To think About:Improve FiguresCoordinate terminology (relationship/association/link)

Other Evidence sources Gene -> PubMed Articles -> Disease Gene -> Orthologous Gene -> PubMed Articles -> Disease Gene -> Interacting/Regulated Gene -> PubMed Articles -> Disease Gene -> Structural Element -> PubMed Articles Gene -> Function -> Related Genes -> PubMed Articles -> Disease

Better word choice?Redefine Evidence as Property? Entity == ?Linkage/Association/Relationship/…

Extraction and Evaluation of Transcription Factor Gene-Disease Association

Thesis Proposal for Doctor of Philosophy

Warren Cheung

Supervised byFrancis Ouellette Wyeth Wasserman

Page 2: Abstractdnahelix.wdfiles.com/local--files/phd-research-proposal… · Web view2007/07/18  · Supervised by Francis Ouellette Wyeth Wasserman Table of Contents Table of Contents 2

Thesis Proposal (Warren Cheung) Page 2 of 17document.doc

Table of Contents

Table of Contents...................................................................................................2A. Problem Statement........................................................................................2

Summary of Goals...............................................................................................2Motivation............................................................................................................3Example Use Cases..............................................................................................3Existing Methods.................................................................................................3

B. Proposed Method...........................................................................................5Genes...................................................................................................................6Disease.................................................................................................................6Features................................................................................................................6Linkages...............................................................................................................7Quantitative Evaluation.......................................................................................8Validation............................................................................................................9

C. Goals..............................................................................................................101) Main TF Gene-Disease Association Prediction Model.............................102) TF Gene-Disease Association Property Predictions..................................103) Gene Cluster-Disease Association Predictions..........................................10Common Goals..................................................................................................10

D. Project...........................................................................................................10Principles...........................................................................................................10

E. Appendix I - Data Sources..........................................................................11Genes.................................................................................................................11Disease...............................................................................................................13Evidence............................................................................................................14Prototype Implementation.................................................................................15

F. References.....................................................................................................16

A. Problem StatementThe purpose of this research will be to identify effective methods of quantitatively

evaluate the relationship between transcription factor genes and diseases via literature evidence, identifying existing associations and predicting novel associations. To accomplish this, I shall explore ways to link various forms of evidence with genes and diseases, quantitative methods to evaluate the resulting associations, and validate the resulting analyses.

Summary of Goals

1. Main TF Gene-Disease Association Prediction ModelEvaluate quantitatively the association between each transcription factor gene and

each disease.

Page 3: Abstractdnahelix.wdfiles.com/local--files/phd-research-proposal… · Web view2007/07/18  · Supervised by Francis Ouellette Wyeth Wasserman Table of Contents Table of Contents 2

Thesis Proposal (Warren Cheung) Page 3 of 17document.doc

2. TF Gene-Disease Association Property PredictionsEvaluate quantitatively what properties are relevant to a gene-disease pairing.

3. Gene Cluster-Disease Association PredictionsIdentify clusters of similar genes associated to disease.

MotivationTranscription factors are regulators of gene expression, involved via the

recruitment of other transcription initiation factors as well as causing DNA conformational change. They can also act as part of protein complexes. Brain diseases are a broad disease area, encompassing a wide range of complex, abnormal phenotypes, including combinations of lethality, neurodegeneration, paralysis and behavioural abnormality. Many diseases are not very well understood or well characterised, and many have complex genetic components involving multiple genes.

Transcription factors in particular play a key role in the brain. Given the incredible diversity of the neuronal and glial cells and their complex arrangement, the careful balance of transcription factors is vital to the proper development of the brain, determination of the cell subtype and migration. This relationship continues through to the adult brain, where transcription factor activity is linked to neuronal survival, differentiation, proper cellular function and neuroplasticity.

Existing databases and analyses can be leveraged as sources of information to use with my research. For example, databases such as JASPAR, OregAnno, TF-Cat and PAZAR can provide information on transcription factors, via evidence of transcription factor activity (TF-Cat) as well as interaction with DNA binding sites (PAZAR). Data from the Pleiades project, studying region-specific promoters in the mouse brain, could potentially be used to validate results from this thesis, linking gene promoter elements with expression in specific brain regions.

Example Use CasesA researcher performs a microarray experiment, comparing expression of genes

from tissue in individuals with and without symptoms of a neurological disease. From the set of genes showing differential expression between these two conditions, the user is looking for the set of genes most likely to be involved in diseases of interest and supporting evidence for such relationships.

A researcher wishes to get a ranked list of known and candidate genes and properties relevant to a particular disease, each with a list of supporting evidence, highlighting potential pathways and regulatory relationships between these genes.

Existing MethodsExisting methods concentrate on analyzing sets of candidate genes and either

reducing or ranking the genes in the set. Methods use a variety of input data sources, from numerical features derived from the raw DNA and protein sequences, annotations of proteins and genes, to text mining PubMed abstracts and OMIM articles. The current methods focus on using properties from a representative set of genes to identify similar genes from the candidate set. A collection of these methods was applied together towards

Page 4: Abstractdnahelix.wdfiles.com/local--files/phd-research-proposal… · Web view2007/07/18  · Supervised by Francis Ouellette Wyeth Wasserman Table of Contents Table of Contents 2

Thesis Proposal (Warren Cheung) Page 4 of 17document.doc

identifying genes responsible for diabetes and obesity(Tiffin, Adie et al. 2006). CAESAR(Gaulton, Mohlke et al. 2007), Endeavor(Aerts, Lambrechts et al. 2006) and an update to G2D(Perez-Iratxeta, Bork et al. 2007) are more recent developments in this field.

[Summary Table of Methods used by the related work]

One method for identifying disease-related genes involved clustering the diseases in OMIM, rather than the disease genes, using indices such as primary tissue involved, age of onset, primary etiology, episodic occurrence and their mode of inheritance. Similarity between two disease is the weighted contributions of each of these indices. Once the clusters are determined (using a strategy that involves manual thresholding by a human expert), the candidate genes are compared to the disease genes underlying the diseases in each cluster using the annotations from GOA. The score for a candidate gene for a disease cluster is the average, over all GO terms, of the ratio of occurrences of the GO term in the cluster, if it matches the candidate gene (otherwise 0), and the occurrences of the GO term in all disease genes. This score is then downscaled by the the number of genes in the cluster. They validate their results using leave-one-out cross-validation.

One method to tackle the general problem of identifying pertinent genes is to narrow the relevant genes via specific constraints, with the output being results that satisfy some or all these constraints. GeneSeeker(Van Driel, Cuelenaere et al. 2005) can find genes within a chromosomal location that are localized in particular tissues, by looking at human and mouse expression data. Another method of associating disease genes to anatomical locations(Tiffin, Kelso et al. 2005) performed text mining of PubMed abstracts to associate eVOC anatomical ontology terms to disease gene names.

Another method is to treat the problem as a machine learning problem, and use the representative set of genes as training data. In DGP(López-Bigas and Ouzounis 2004), this technique is used to find features common to disease genes in general, using a decision tree classifier trained sample disease and control proteins. Features were protein length, as well as BLASTP ratios (conservation score) between a protein and its highest scoring homologue within taxonomic groups (representing phylogenetic conservation and extent) and the conservation score with the closest paralogue. Their analysis indicates that, on average, hereditary disease genes (genes taken from OMIM) are longer, more conserved, phylogenetically extended and without close paralogues.

PROSPECTR(Adie, Adams et al. 2005) uses a wider variety of features, including the length of the gene, the length of its coding sequence, the length of its cDNA, length of the protein, GC content and percentage protein identity with its nearest homologue in various species (mouse, worm, fly). They use an alternating decision tree, taking again genes from OMIM and comparing against genes not found in OMIM. They also generated two independent test sets – one using genes from the Human Gene Mutation Database with randomly selected control genes, and the another set of 54 genes not in OMIM, but known to be involved in oligogenic disorders, again with a set of randomly selected control genes.

POCUS(Turner, Clutterbuck et al. 2003) takes another approach at identifying disease-related genes. The input in this case is not all disease-related genes, but rather a

Page 5: Abstractdnahelix.wdfiles.com/local--files/phd-research-proposal… · Web view2007/07/18  · Supervised by Francis Ouellette Wyeth Wasserman Table of Contents Table of Contents 2

Thesis Proposal (Warren Cheung) Page 5 of 17document.doc

selected training set of genes (from differing susceptibility regions), that are representative for the disease in question. POCUS will then look for common features between the training genes – InterPro domains, GO annotations, similar expression profile – and compares against the chance these common features would occur by chance. This method assumes that genes related to the disease are more likely to share functional annotation than chance.

G2D(Perez-Iratxeta, Bork et al. 2002) links genes from a specified genomic locus to diseases by examining PubMed MeSH disease and chemical term annotation and RefSeq GO annotations. MeSH disease terms were mapped to MeSH chemical terms via co-occurring annotation of PubMed articles. RefSeq GO annotations were linked to the MeSH chemical terms via the PubMed references in the GO annotations. Fuzzy set relation scores were generated for these pairwise associations as the ratio of the cardinality of the intersection against the union. The score for the combined disease-chemical-gene relation is defined as the product of the two pairwise relations, and the score for a disease-gene relation is simply the maximum of all possible scores. Recently, it has been developed into a web server(Perez-Iratxeta, Wjst et al. 2005), and the most recent update(Perez-Iratxeta, Bork et al. 2007) includes several other methods of inferring disease-gene associations, involving the user providing genes from other genomic regions related to the disease. The first method is more stringent – it looks for disease genes sharing functional similarity with the specified genes. The second method looks for functional association via protein-protein interactions (provided by the STRING database).

The Endeavor system(Aerts, Lambrechts et al. 2006) aims to create an extendible system for prioritizing disease genes using heterogeneous data sources. The input to the system is a training set of genes. They evaluated the performance of the system against monogenic diseases (automatically extracted from OMIM), polygenic diseases (six genes recently determined to be involved in polygenic disease) and also for functional role in regulatory pathways (by examining differential RT-PCR expression). They also performed functional validation in zebrafish, searching for DiGeorge syndrome (DGS), by using a training set of genes causing DGS and DGS-like symptoms. This resulted in both the prioritization of TBX1, a known DGS-related gene, and YPEL1, which yielded in DGS-like defects when expression was knocked down in vivo.

More recently, CAESAR(Gaulton, Mohlke et al. 2007) takes an representative input text on the disease and uses text mining to determine relevant ontology terms. For each data sources (including GOA, InterPro and protein-protein interaction databases), genes are ranked based on the annotated ontology terms. The gene ranks are then integrated, using the functions sum, mean, maximum as well as a transformed score that considers both the rank of a gene for each data source and the number of genes returned by that data source.

B. Proposed MethodWe propose a method that extracts gene-disease associations, emphasizing

verifiable supporting evidence for the predicted associations and a quantitative evaluation of the strength of the association. We shall investigate both associations between genes and disease, as well as properties of the gene-disease association.

Page 6: Abstractdnahelix.wdfiles.com/local--files/phd-research-proposal… · Web view2007/07/18  · Supervised by Francis Ouellette Wyeth Wasserman Table of Contents Table of Contents 2

Thesis Proposal (Warren Cheung) Page 6 of 17document.doc

Genes

Evidence

Diseases

We shall consider three base entities: Genes, Diseases, Evidence and relationships between these entities. Our goal will be to predict Gene-Disease relationships based on the existence of relationships within and between the entities, creating paths between the Gene and Disease entities. Starting with relationship of shared evidence between a gene and a disease, we will also consider less direct relationships, such as orthologous genes in other species and related diseases.

These paths of supporting evidence will be quantitatively evaluated, making it possible to both extract strongly supported gene-disease linkages and to rank these linkages.

Although the thesis itself will investigate properties of transcription factor genes in diseases, the methods and analysis will be designed for general application. For the initial analysis of the main gene-disease associations, we shall investigate brain diseases specifically. Once we reach the stage of mining property associations and analysis of clusters of genes, we shall select a second disease area to look at to both allow more variety in analyses and demonstrate the generality of the method.

GenesWe shall also consider all genes in Entrez Gene as our primary source for genes,

mapping references to genes, DNA, RNA and protein products as needed. We shall identify transcription factors via GOA, supplemented by genes in a TF-specific database, TF-Cat. The Ensembl Gene set may be mapped to, combined with, or used as an alternative to Entrez Gene.

DiseaseWe shall use the disease terms in the MeSH ontology as a primary source for

disease terms. Other vocabularies/ontologies, such as the UMLS Metathesaurus concepts, Disease Ontology, ICD and SNOMED CT may also be used in conjunction or in place of the MeSH ontology.

FeaturesIn general, features encompass all descriptive properties of genes and diseases,

qualitative and quantitative. Qualitative features include ontology or vocabulary annotations, such as from GOA for genes and MeSH terms for PubMed articles, or may be free text, such as GeneRIFs. Quantitative features include numerical attributes, such

Page 7: Abstractdnahelix.wdfiles.com/local--files/phd-research-proposal… · Web view2007/07/18  · Supervised by Francis Ouellette Wyeth Wasserman Table of Contents Table of Contents 2

Thesis Proposal (Warren Cheung) Page 7 of 17document.doc

as the length of coding sequence, as well as derived numerical attributes, such as BLAST similarity score to the nearest murine homologue.

EvidenceWe shall consider PubMed articles as a primary source of supporting evidence.

All other forms of experimental evidence (microarray data, gene linkage studies) will be mapped to the relevant PubMed article in order to be considered. Additionally, we can consider properties derived from primary data sources, such as length of the protein sequence from the NCBI and Ensembl sequence repositories.

LinkagesTo find evidence associating transcription factors with diseases, we shall look at

integrate and evaluate the strength of the links between genes, evidence and disease. This divides the linkages into five broad categories: Gene-Gene, Gene-Evidence, Evidence-Evidence, Evidence-Disease and Disease-Disease.

Gene-Gene relationships include homology and gene interactions. When considering a human transcription factor gene, information can be gleaned from paralogs, highly similar genes potentially arising from an ancestral gene duplication event, and orthologs in a closely related species. Gene interaction includes protein-protein interactions as well as regulatory mechanisms, from interfering RNA to the transcriptional regulation effects at DNA binding sites of the transcription factors. These related genes are likely to share elements in common to the considered human TF. From the presumed evolutionary relationships, paralogs are likely to share function and orthologs are likely to perform the same role. [REF basic ortholog = function], although recently there is evidence supporting significant divergence between the mouse and human genome-wide transcription factor-DNA binding profile (Odom, Dowell et al. 2007). Interaction partners and downstream regulated genes are likely to be involved in some common process. The gene-gene relationships will be extracted from curated sources such as Orthologene, and also computationally derived via commonly-used gene similarity metrics such as BLAST E-values. Interaction databases such as BIND, Intact and STRING will be used to extract other protein-protein relationships. The PAZAR database can also provide TF-gene regulation relationships.

Gene-Evidence relationships include gene references in PubMed articles. GeneRIFs, RefSeq Related Articles and Gene Ontology literature reference articles relevant to the gene of interest. As well, the transcription factor database TF-Cat links transcription factor genes with relevant articles. Many of the links describe the reason for the linkage, whether via plain text (GeneRIFs and TF-Cat) or ontology terms (GOA). [FIXME quantitative scores?]

Evidence can be linked by similarity as well as via citations. PubMed related articles links articles in PubMed by the similarity of their abstract text, as well as citations, both articles that cite and are cited by an article in question.

Disease-Evidence links are taken from references to a particular disease from an article. We shall use MeSH headings, annotated by NLM curators on PubMed articles, to link evidence to disease.

Page 8: Abstractdnahelix.wdfiles.com/local--files/phd-research-proposal… · Web view2007/07/18  · Supervised by Francis Ouellette Wyeth Wasserman Table of Contents Table of Contents 2

Thesis Proposal (Warren Cheung) Page 8 of 17document.doc

Disease-Disease linkages can be gleaned from ontological relationships, and from hierarchical arrangements in organized vocabularies. The MeSH hierarchy will be used to determine relationship between disease entities.

Quantitative Evaluation

Scoring RelationshipsTo evaluate the results obtained, we shall aim to generate relevant and intuitive

numerical scoring methods. Our goal is for the scoring methods to be sufficiently general to allow evaluation and comparison between comparisons of varied forms of evidence and methods.

To evaluate strength of a linkage between two entities (e.g. a TF gene and a disease) supported by evidence (e.g. from a subset of all PubMed articles), we consider a null hypothesis – that the linkage found occurred entirely by chance. We can therefore examine the probability of the evidence found occurring by chance. In the example, we consider n PubMed articles that are referenced by the gene, and the k articles which are annotated as linked to the disease. We then compare against the K articles that are annotated as linked to the disease, and the N articles in PubMed. If we consider each article referenced by the gene as a random draw from the pool of all articles available (the subset of all PubMed articles), we can use a hypergeometric distribution to model the number of articles we would see by chance annotated as linked to the disease and quantitatively evaluate our results. Therefore, if we observe that x articles referenced by the gene are associated to the disease,

These results equate to performing a one-tailed Fisher’s exact test. Should this prove too computationally expensive or inaccurate to compute, we can approximate this using the binomial distribution, if n is much smaller than N-K and K.

Multiple Testing CorrectionMultiple testing correction will be employed in cases where we examine the

potential association between a gene and each of the diseases – for example, when the investigator specifies a particular gene, and requests a list of all diseases associated with the gene. The danger in such case is potentially increased Type I (false positive) error. In such a case, we can employ the Bonferroni (familywise error) correction – effectively, we divide the significance level (e.g. α = 0.05) we are looking for by the number of tests we employed (e.g. the number of diseases we tested for) and count significant the p-values that fall below this conservative threshold.

When considering many tests, the penalty imposed by Bonferroni correction prove too extreme, resulting in a substantial increase of Type II (false negative) error. As an alternative, we could employ Benjamini-Hochberg (false discovery rate) correction to control the Type I error explicitly. In this case, rather than controlling for single erroneous rejection of the null hypothesis, we control the fraction of erroneous rejection.

Page 9: Abstractdnahelix.wdfiles.com/local--files/phd-research-proposal… · Web view2007/07/18  · Supervised by Francis Ouellette Wyeth Wasserman Table of Contents Table of Contents 2

Thesis Proposal (Warren Cheung) Page 9 of 17document.doc

This method is has shown to be applicable when the tests are independent and when the tests are positively correlated, and has been used for correction of GO term overrepresentation.

Joint ProbabilityIn general, we can utilize the overrepresentation analysis to determine when two

When considering two links, linking gene A to feature B, and feature B to disease C, with p-values p(B|A) and p(C|B), we often wish to estimate the probability of the secondary relation, p(C|A). Assuming that the relation A->B->C is transitive, and that p(B|A) and p(C|B) are independent, we can compute p(C|A) as the joint probability p(B|A AND C|B). Then probability of the combined link will be p(AB) + p(BC) - p(AB)p(BC).

Another heuristic, useable when we wish to examine multiple links, is the shortest path heuristic(Zhou, Kao et al. 2002). Each link becomes the weight of an edge in a graph, and the length of the shortest path between two points is the value given for that relation.

ValidationValidation of the data will be performed in three ways — using OMIM gene and

disease entries, comparison with more recent data and manual verification. This will test the sensitivity of our method, by providing positive examples. As it is impossible to rule out a future link between a gene and disease, there is no negative data.

To evaluate the basic accuracy of the relationship links suggested by the system, we can use OMIM entries noting a link between a gene and disease will comprise one set of positive data Y, to be compared against the results generated by our system X. By

taking the ratio , we can evaluate the sensitivity of our method — the fraction of

the positive examples that are correctly identified by the system. This will be the evaluation of the predictive capability of the system. Doing this using the most recent versions of the database, this would evaluate the ability of the system to reconstruct the known relationships in OMIM. Similarly, we can manually evaluate the results of the system using the associated evidence, such as determining whether the PubMed articles referenced support the gene-disease association hypothesis.

To look at the predictive ability of the system, we can also freeze the databases loaded by the system before a particular date and use these slightly obsolete databases for the analysis. We can then curate the more recent literature since the frozen time-point for novel gene-disease linkage discoveries which will provide a second set of positive data. Methods to generate this curated dataset would be to look for new OMIM disease and gene entries and manual curation. Manual curation can be assisted by the system by generating relationships and verifying the evidence manually.

OMIM is a fairly conservative source of gene and disease information, and therefore will not necessarily have all the most recent discoveries curated. One method to more accurately place the time

We can also manually verify the evidence supplied by the system for a particular gene-disease linkage. By examining the PubMed articles referenced, we can evaluate whether it is relevant to the gene, the disease, both or none. This form of verification will evaluate the relevance of the data extracted by the system.

Page 10: Abstractdnahelix.wdfiles.com/local--files/phd-research-proposal… · Web view2007/07/18  · Supervised by Francis Ouellette Wyeth Wasserman Table of Contents Table of Contents 2

Thesis Proposal (Warren Cheung) Page 10 of 17document.doc

C. Goals

1) Main TF Gene-Disease Association Prediction ModelAssociations between genes and diseases will be identified, implicating specific

genes with specific diseases with a quantitative strength. Tool to derive associations between genes and diseases from the database Model to quantitatively evaluate associations extracted Validation of the associations derived and the model

2) TF Gene-Disease Association Property PredictionsAssociations between genes and diseases will be expanded to elucidate additional

properties, such as the functional role of the gene in the disease, the affected locations, as well as investigate the relationship between genes involved in the disease, such as via protein-protein interactions and transcriptional regulation.

Tool to analyse gene-property-disease associations Model to quantitatively evaluate the properties derived Validation of the additional properties derived and the model

3) Gene Cluster-Disease Association PredictionsMeta-analysis, using the results from previous association analyses, will focus on

finding clusters of genes related to disease. Using traditional methods, such as k-means, as well as more recently developed methods such as (Jochen’s rank-based prior) and OPTICS, we shall investigate whether the genes can be clustered in disease and disease-property meaningful ways.

Cluster genes, looking for disease and disease-property clusters Validation by examining known disease-related genes and disease genes

involved in pathways

Common GoalsData on genes, diseases and evidence used to support the gene-disease

associations will be extracted and stored, to support analysis and validation. Database of transcription factor genes, diseases and evidence data Tool to create and update the database from relevant data sources

D. Project

Principles

Quantitative TF Gene-Disease RelationshipsThe tools will allow examination of known and predicted gene-disease

relationship, and quantitatively evaluate these relationships. The evidence supporting predictions will be accessible, allowing users direct means to confirm the predictions.

Page 11: Abstractdnahelix.wdfiles.com/local--files/phd-research-proposal… · Web view2007/07/18  · Supervised by Francis Ouellette Wyeth Wasserman Table of Contents Table of Contents 2

Thesis Proposal (Warren Cheung) Page 11 of 17document.doc

The system will be designed to accommodate more general use in other disease areas or types of genes.

Open AccessFreely available data sources will be used. The tools developed and results of the

analyses will be made publicly available and published in open access journals.

Modular, Efficient Programmatic FrameworkA comprehensive toolkit for analysis will be developed. Scalable algorithms will

be used to handle the extremely large, expanding datasets involved. Efficient methods to extract data from the large dataset will be developed.

E. Appendix I - Data SourcesThe system will be designed to provide a complete storage solution for genes,

diseases and evidence from disparate databases, as well as existing and computed annotations and relationships. A consistent interface will allow straightforward access to all the data. This data will be stored in a database, with programs written to both load and update from the data sources.

Initially, we can separate our concerns into three areas — genes, diseases and evidence. Genes refer to loci on the chromosomes of humans, generally protein-coding, including the relevant regulatory elements. Diseases refer to abnormal human phenotypes. Evidence refers to all the data that will be used to link genes to disease.

Due to the extreme sizes of the data sources involved (16 million entries in PubMed alone), we shall consolidate the data in a local database. This will ensure maximal efficiency for accessing the data when performing the analyses. As well, this will put all the data in a common, controlled format, which will simplify the downstream analyses and make the development of the subsequent tasks independent on the data acquisition task.

GenesThe ultimate goal of the thesis is to link human transcription factor genes with

human diseases. However, genes in other organisms, especially the closely related and well-studied mouse models, as well as other genes, such genes regulated by transcription factors, will need to be considered. As well, in existing methods, candidate genes may be specified directly by the user or selected via broad chromosomal regions. To accommodate the range of genes that may be used in our analyses, I shall use Entrez Gene as the primary source reference for genes.

Entrez GeneThis NCBI database tracks genes annotated in genomes, from known genes toprotein coding regions (e.g. in viruses) and predicted genes. A unique gene

identifier is assigned for each gene in each species. Data in Entrez Gene comes from both curated and automatically generated sources. This includes information from and links to

Page 12: Abstractdnahelix.wdfiles.com/local--files/phd-research-proposal… · Web view2007/07/18  · Supervised by Francis Ouellette Wyeth Wasserman Table of Contents Table of Contents 2

Thesis Proposal (Warren Cheung) Page 12 of 17document.doc

sequences in NCBI Reference Sequence (RefSeq). Gene Ontology (GO) annotations are provided by the Gene Ontology Annotation (GOA) Database. Data from Entrez-accessible sources at the NCBI can be accessed via NCBI Entrez EUtils, as well as downloaded via FTP as compressed text files.

Gene OntologyGO terms for a collaborative effort to provide a consistent nomenclature for gene

annotations and for indicating the strength of the evidence supporting such annotations. In addition to the three original members, the model organism databases FlyBase, Saccharomyces Genome Database external link (SGD) and the Mouse Genome Database (MGD), there are now over ten full members, including GOA, and several associate members. GO is composed of three main ontologies - biological processes, cellular components and molecular functions. Annotations are described by a three-letter controlled vocabulary of evidence codes, from inferences by electronic annotation (IEA) to traceable author statements (TAS). However, GO does not describe "abnormal" features, such as mutant or disease-specific traits. The Gene Ontology Annotation Database is responsible for annotations to proteins in the human, chicken and cow genomes in UniProtKB, and is supplemented by annotations from other groups. Priority is given to proteins without annotation, those with disease relevance and those relevant to high-throughput analyses. We use the GO term “transcription factor” to identify genes that are transcription factors.

Statistics

1209

7866

38624

2631524

1 10 100 1000 10000 100000 1000000 1E+07 1E+08

Human TF Genes

TF Genes

Human Genes

All Genes

Other sources for Transcription FactorsI shall also examine the integration of other data sources to increase both the

coverage of transcription factors as well as providing more direct links to literature. Curated TF databases, such as the locally developed TF-Cat database, can provide a specialised, annotated resource for transcription factors.

Page 13: Abstractdnahelix.wdfiles.com/local--files/phd-research-proposal… · Web view2007/07/18  · Supervised by Francis Ouellette Wyeth Wasserman Table of Contents Table of Contents 2

Thesis Proposal (Warren Cheung) Page 13 of 17document.doc

DiseaseNo standard ontology or vocabulary for diseases is currently in widespread use.

However, several standards exist for categorizing data in various fields relate closely — Medical Subject Headings, used to annotate PubMed articles, the International Classification of Diseases, standard terminology used worldwide to track morbidity, and SnoMed CT, an emerging standard for health records. The Unified Medical Language System Metathesaurus and the Disease Ontology will provide methods of unifying these terminologies. [GALEN project?]

[licensing issues?]

Medical Subject Headings (MeSH)MeSH is a controlled vocabulary thesaurus of descriptors, arranged in a

hierarchical structure. Sixteen main categories (e.g. Anatomy, Disease) at the top are divided into subcategories, and then the descriptors are placed into the tree, with more general terms near the top to the most specific, with a descriptor potentially occurring more than once in the tree. We shall initially use the category C, in particular, tree number C10.228.140, "Brain Diseases", and its subheadings, to as labels of for disease. However, as MeSH is a general subject classification system, disease labels will often be general rather than specific – for example, “Spinocerebellar ataxias” (SCA) exists as a distinct MeSH term, but the specific SCA types do not.

International Classification of Diseases (ICD)Also known as the International Statistical Classification of Diseases and Related

Health Problems, this classification system, published by the World Health Organisation, provides codes to classify diseases, and also signs of health problems such as symptoms, social circumstances and external causes of injury. It is currently in its 10th revision, ICD-10, and is used to track mortality statistics worldwide. ICD-10-CA is an enhanced version developed by the Canadian Institute for Health Information for morbidity classification, and was phased in from 2001-2006. ICD-9-CM, based on the 9th ICD release, is the current official standard used by U.S. hospitals. Incorporation of ICD variants would therefore allow interoperability with morbidity data gathered and publicly available.

Systematized Nomenclature of Medicine – Clinical Terms (SNOMED CT)

SNOMED-CT was originally developed by the College of American Pathologists. As of April 2007, it is owned by the International Health Terminology Standards Development Organisation (IHTSDO). Canada is a founding member country of IHTSDO, and is represented by the Canada Health Infoway, an organization aimed at providing interoperable electronic health record solutions for Canadians. [EXISTING USES?]

Page 14: Abstractdnahelix.wdfiles.com/local--files/phd-research-proposal… · Web view2007/07/18  · Supervised by Francis Ouellette Wyeth Wasserman Table of Contents Table of Contents 2

Thesis Proposal (Warren Cheung) Page 14 of 17document.doc

The Unified Medical Language System (UMLS) MetathesaurusThe UMLS Metathesaurus contains database of medical terminology, provided by

the National Library of Medicine. It provides a mapping to unique concept identifiers from vocabularies including MeSH, ICD and SNOMED CT.

Disease OntologyThis controlled vocabulary, currently under development at the Center for Genetic

Medicine, Northwestern University, aims to facilitate mapping diseases and conditions, and uses the UMLS to map terminologies such as SNOMED and ICD into the Open Biomedical Ontology format. The previous stable release version of the ontology was based primarily on ICD-9-CM.

(Online) Mendelian Inheritance in Man (OMIM)OMIM provides access to curated reports in human-readable text format on both

genes implicated in diseases and diseases with a genetic component. Articles include inline PubMed references as supporting evidence. OMIM has been used [REFS] as a source for genetic diseases, however, this would only provide a list of known (potentially) genetic diseases, leaving out diseases that do not yet have a known genetic component.

EvidenceAs we are focusing research on verifiable experimental evidence, our sources of

data will include scientific articles summarizing the results of experiments in addition to other databases of experimental results and the results of simple analyses. PubMed will provide a basic source of scientific articles.

PubMedPubMed is a searchable citation database at the NCBI, indexing biomedical

literature. Bibliographical citation information is taken primarily from the National Library of Medicine (NLM) MEDLINE database, although some journals indexed for their biomedical articles have all their articles indexed, and there are also legacy articles from OLDMEDLINE, as well as other initiatives that experimented with indexing other scientific literature.

GeneRIFGene Reference Into Function (GeneRIF) are annotations both submitted by the

public and curated by the National Library of Medicine, describing references to gene function. Gene function in this case defined very broadly, referring to not only biological function, but also information about the gene's role in disease, as well as its discovery and mapping. In addition to general GeneRIFs, there are also two other major sources of GeneRIFs: information from HIV-1, the Human Protein Interaction Database, and information from the protein-protein interaction databases BIND, BioGRID, EcoCyc and HPRD. All GeneRIFs include a reference to at least one PubMed article as evidence. GeneRIFs associate PubMed evidence with genes.

Page 15: Abstractdnahelix.wdfiles.com/local--files/phd-research-proposal… · Web view2007/07/18  · Supervised by Francis Ouellette Wyeth Wasserman Table of Contents Table of Contents 2

Thesis Proposal (Warren Cheung) Page 15 of 17document.doc

MeSH annotationsPubMed/MEDLINE entries are continually being indexed using MeSH terms by

curators at the NLM. Each article is indexed by one or more MeSH terms, each of which may also have one of 83 topical qualifier subheadings (e.g. analysis, education or therapy) to potentially indicate a more specific topic.

StatisticsTotal Genes with GeneRIFs: 33216Human TF Genes with GeneRIF: 914

PubMed articles: 16,120,074PubMed articles with MeSH headings: 15,806,221PubMed articles with Brain Disease MeSH (or more specific) terms: 660538MeSH terms: 47143Unique MeSH terms: 24355MeSH terms under Brain Diseases: 312

Other forms of EvidenceThe PAZAR database identifies regulatory elements associated with genes,

potentially revealing interconnected regulatory programs. The String database incorporates both experimental protein-protein interaction evidence as well predicted interactions. The KEGG database provides pathways.

Experimental evidence has also been incorporated in programs such as GeneSeeker and POCUS. Other annotations used include eVoc annotations (Tiffin), InterPro domains, secondary properties derived from DNA or protein sequences (DGP, PROSPECTR). As well, some programs use text mining, extracting information from textual sources such as PubMed abstracts and OMIM articles.

Prototype ImplementationThe data will be stored in a relational database. Each of the major concepts —

genes, diseases and evidence — will be represented as abstract entities. A specific instance of an entity will be both a member of the abstract, and also store specific information separately. The abstract entities will only contain data relevant to the analyses for efficiency — specific information can be referenced afterwards.

Page 16: Abstractdnahelix.wdfiles.com/local--files/phd-research-proposal… · Web view2007/07/18  · Supervised by Francis Ouellette Wyeth Wasserman Table of Contents Table of Contents 2

Thesis Proposal (Warren Cheung) Page 16 of 17document.doc

EntrezGene

Entrez Gene IDLocus name

PubMed

PMIDTitle

GeneRIF

Entrez Gene IDPMID

HeadingDescription

PubMedMeSH

Annotations

PMIDMeSH Term

MeSH QualifierMajor Topic?

MeSH

TermTree_number

RelatedArticles

PMIDRelated_PMID

Score

F. References

Adie, E., R. Adams, et al. (2005). "Speeding disease gene discovery by sequence based candidate prioritization." BMC Bioinformatics 6(1): 55.

Aerts, S., D. Lambrechts, et al. (2006). "Gene prioritization through genomic data fusion." Nat Biotech 24(5): 537-44.

Gaulton, K., K. Mohlke, et al. (2007). "A computational system to select candidate genes for complex human traits." Bioinformatics.

Gaulton, K., K. Mohlke, et al. (2007). "A computational system to select candidate genes for complex human traits."

López-Bigas, N. and C. Ouzounis (2004). "Genome-wide identification of genes likely to be involved in human genetic disease." Nucleic Acids Research 32(10): 3108.

Odom, D., R. Dowell, et al. (2007). "Tissue-specific transcriptional regulation has diverged significantly between human and mouse." Nat Genet 39(6): 730-2.

Perez-Iratxeta, C., P. Bork, et al. (2007). "Update of the G2D tool for prioritization of gene candidates to inherited diseases." Nucleic Acids Research.

Perez-Iratxeta, C., P. Bork, et al. (2002). "Association of genes to genetically inherited diseases using data mining." Nat Genet 31(3): 316-9.

Perez-Iratxeta, C., M. Wjst, et al. (2005). "G2D: a tool for mining genes associated with disease." BMC Genetics 6(1): 45.

Page 17: Abstractdnahelix.wdfiles.com/local--files/phd-research-proposal… · Web view2007/07/18  · Supervised by Francis Ouellette Wyeth Wasserman Table of Contents Table of Contents 2

Thesis Proposal (Warren Cheung) Page 17 of 17document.doc

Tiffin, N., E. Adie, et al. (2006). "Computational disease gene identification: a concert of methods prioritizes type 2 diabetes and obesity candidate genes." Nucleic Acids Research 34(10): 3067.

Tiffin, N., J. Kelso, et al. (2005). "Integration of text- and data-mining using ontologies successfully selects disease gene candidates." Nucleic Acids Research 33(5): 1544-52.

Turner, F., D. Clutterbuck, et al. (2003). "POCUS: mining genomic sequence annotation to predict disease genes." Genome Biology 4(11): 75.

Van Driel, M. A., K. Cuelenaere, et al. (2005). "GeneSeeker: extraction and integration of human disease-related information from web-based genetic databases." Nucleic Acids Research 33(web server): 758.

Zhou, X., M.-C. Kao, et al. (2002). "Transitive functional annotation by shortest-path analysis of gene expression data." Proceedings of the National Academy of Sciences of the United States of America 99(20): 12783-8.