disease gene prioritization
DESCRIPTION
Disease Gene Prioritization . Presented by Qian Huang. What is a disease. Disease is a condition of the living animal or plant or one of its parts that impairs normal function and is typically manifested by distinguishing signs and symptoms - PowerPoint PPT PresentationTRANSCRIPT
DISEASE GENE PRIORITIZATION
Presented by Qian Huang
What is a disease• Disease is a condition of the living animal or plant or one
of its parts that impairs normal function and is typically manifested by distinguishing signs and symptoms
• Definition also describes the malfunction of individual cells or cell groups
• Many diseases should be defined on a cellular level. • Sickle cell disease was first documented in 1904• sickle cell disease became the first disease to be
characterized on a molecular level in 1949• The first genetic diseases was discovered
Genetic diseases• A genetic disease is any disease that is caused by an
abnormality in an individual's genome• It is rarely that one gene is responsible for one function.
An assembly of genes constitutes a functional module or a molecular pathway.
• a molecular pathway leads to some specific end point in cellular functionality via a series of interactions between molecules in the cell.
• Any changes in the normally molecular interactions and pathways may lead to disease
• The specifics of a change determine the severity and the type of the resulting disease
Genetic diseases• inherited from the parents or caused by mutations• a number of different types of genetic diseases• Single gene disorder - Mendelian or monogenetic inheritance - caused by changes or mutations that occur in the DNA sequence of a single gene - over 4000 human diseases caused by single gene disorder -occur in about 1 out of every 200 births - dominant: Only one mutated copy of the gene will be necessary for a person to be affected. one affected parent, 50% chance - recessive: Two copies of the gene must be mutated for a person to be affected. Two unaffected people each carry one copy of the mutated gene, 25% chance the child affected
Genetic diseases• Chromosome abnormalities - distinct structures made up of DNA and protein - caused by abnormalities in chromosome number or structure - due to a problem with cell division• Multifactorial gene disorder - caused by a combination of environmental factors and mutations in multiple genes - heart disease, high blood pressure, cancer, diabetes, obesity
Genetic diseases• Identifying the relationship between human genetic
diseases and their causal genes is important in human medical improvement
• Revealing the genetic basis of human disease is a fundamental aim of the human genetic studies
• The Human Genomic Project started in 1990 • The genomic studies rapid accumulate large amount of
genomic data • a lot of computational methods were proposed to prioritize
candidate casual genes by considering the relationship of candidate genes of a given phenotype and existing known disease genes
What is Gene Prioritization • Gene prioritization is the process of assigning likelihood of
gene involvement in generating a disease phenotype.• narrows down, and arranges the set of genes to be tested
experimentally. • based on various correlative evidence that associate each
gene with the given disease and suggest possible causal links
• Evidence comes from high-throughput experimentation, including gene expression and function, pathway involvement, and mutation effects
Why using Gene Prioritization • Proving a causal link between a gene and a disease
experimentally is expensive and time-consuming • Using computational prioritization of candidate genes prior
to experimental testing can drastically reduce the associated costs and improve the outcomes of targeted experimental studies
• High-throughput experimental techniques has contributed significantly to the identification of disease-associated genes and mutations and reported a large number of data
• Gene prioritization is a computational method to deal with the quantity of data, effectively translate the experimental data into legible disease-gene associations
Identification of disease-genes • Disease results from the changes of normal function• Four reasons of pathway function changes (1) changes in gene expression (2) changes in structure of the gene-product (3) introduction of new pathway members (4) environmental disruptions• Defining molecular pathways whose disrupted functionality
is necessary and sufficient to cause the disease • All members of the affected pathways can be construed as
disease genes • Identification of disease-genes is difficult
How to identify disease-genes• Disease genes are most often identified using: (1) genome wide association or linkage analysis studies (2) similarity or linkage to co-expression with known disease genes (3) participation in known disease-associated pathways or compartments. • Methods represented by direct and indirect evidence o Direct : evidence coming from own experimental work
and from literature o Indirect: genes that are in any way related to already
established disease-associated genes
Indirect evidence Very broadly, gene-disease associations are inferred from evidence of five aspects(1) Functional Evidence The suspect gene is a member of the same molecular pathways as other disease-genes (2) Cross-species Evidence The suspect gene has homologues implicated in generating similar phenotypes in other organisms
Indirect evidence (3) Same-compartment EvidenceThe suspect gene is active in disease-associated pathways (e.g. ion channels), cellular compartments (e.g. cell membrane), and tissues (e.g. Liver) (4) Mutation EvidenceThe suspect genes are affected by functionally deleterious mutations in genomes (5) Text EvidenceThere is ample co-occurrence of gene and disease terms in scientific texts
Overview of gene prioritization data flow
Molecular Interactions • Many gene prioritization tools used gene-gene (protein- protein)
interaction and pathway information to prioritize candidate genes. • genes responsible for similar diseases often participate in the
same interaction networks • MC4R is a receptor and known to be associated with severe
obesity • The interactors of MC4R may be predicted to be linked to obesity. • AgRP and POMC directly bind MC4R for varied purposes of the
MC4R pathway. • mutations that negatively affect normal POMC production or
processing have been shown to be obesity- associated • AgRP have been linked to food intake abnormalities
MC4R-centered protein-protein interaction network
Regulatory and genetic linkage • Co-regulation of genes has traditionally been thought to
point to same molecular pathways and similar disease • Co-expressed genes often cluster together in different
species lead to genetic linkage • Genes co-expressed with or genetically linked to other
disease genes are also likely to be disease-associated • They also pose a problem
Problem with Regulatory and genetic linkage
o A given disease-associated gene may be co-regulated with or linked to another disease-associated gene
o The two diseases are not identical o It is difficult to distinguish the actual causes of disease
and co-occurring with the disease-mutations due to genetic linkage.
Similar sequence/structure/ function• Prioritization tools often use functional similarity as an
input feature • Predictors relying on functional similarity to determine
disease association will link two genes sharing a same function
• Functionally similar genes are likely to produce similar disease phenotypes, sequence/structure similarities are indicators of similar disease involvement
• Disease genes are often associated with specific gene and protein features
o higher exon number o longer gene length
Cross-species Evidence • Cross-species comparisons of orthologues and their
associated phenotype • Finding related phenotypes across species suggests
orthologous human candidate genes • MC4R is known to be associated with severe obesity• Polar bears have a V95I mutation on MC4R for their need
to increase body fat to adapt to their environment • may have a similar (increased body fat) effect in humans
Cross-species Evidence • A correlation of gene co-expression across species is also
useful for gene prioritization • Genes that are part of the same functional module are
generally co-expressed • functionally unrelated genes also could be co-expressed • Comparing genes co-expressed in human and other
organisms can be used to infer disease-genes• A cluster of functionally unrelated genes co-expressed in
human and mouse contained a disease-gene KCNIP4 • The initial list of 1,762 genes mapped to 850 OMIM
(Online Mendelian Inheritance in Man)phenotypes narrow to twenty times fewer possible disease-causing genes.
Compartment Evidence • Changes in gene expression in disease-affected
compartments and tissues are associated with many complex diseases
• Predict suspect gene in the disease-associated pathway, compartments and tissues.
• multiple storage diseases all are caused by the impairment of the degradation pathways of the intracellular transport.
Mutant Evidence • Every genetic disease is associated with some sort of mutation
that alters normal functionality • Selection of candidate genes for further analysis is often based on
mutations in diseased individuals • not all observed mutations are associated with deleterious effects: (1) no effect at all - silent mutations (2) some is deleterious with respect to normal function (3) weakly beneficial • strongly deleterious mutation are relatively rare because they are
rapidly removed by selection• A candidate gene carrying a deleterious mutation is more likely to
be disease-associated than gene with other mutation or no mutation at all
Mutant Evidence • Structural variation (SV) insertions and deletions, inversions, translocations..
• Nucleotide polymorphisms SNPs (single nucleotide polymorphisms) MNPs (multi-nucleotide polymorphisms
• 90% of human variation exists in the form of Nucleotide polymorphisms
Structural variation • Structural variation (SV) is the least studied of all types of
mutations • less than 10% of human genetic variation is in the form of
genome structural variants • Each of the structural variants is large, the total number of
base pairs affected by SVs may actually be comparable to the number of base pairs affected by the much more common SNPs
• Gross changes to genome sequence are very likely to be disease associated
Structural variation • High throughput detection of structural variants is difficult• there are only 180 thousand structural variants reported in
one of the most complete mutation collections-DGV • Do not currently know what proportion of genetic disease
is caused by SVs • Disease is caused by change of a sequence, all of the
genes found in these regions of the genome are, by default, associated with the disease, but none of them can be considered primarily causal
• If diseases that are associated with SVs, the prioritization of disease-causing genes is only finding those that are directly affected by the mutation
Nucleotide polymorphisms• A single human genome is expected to contain roughly
10–15 million SNPs per person • 93% of all human genes contain at least one SNP • MNPs are rare as compared to SNPs• nearly 43 million validated human SNPs - coding SNPs - non-coding SNPs
coding SNPs and non-coding SNPs (1) coding SNPs 17.5 million have been experimentally mapped to functionally distinct regions of the genome, only 0.4M are in coding region while the rest are in non-coding regionCoding SNPs are over-represented in disease associationse.g. OMIM contains 2430 non-coding SNPs (0.0001% of all) and 5327 coding ones (0.01% of all) - synonymous (no effect on protein sequence) - non-synonymous (single amino acid substitution) (2) non-coding SNPs more prevalent than coding SNPs because the majority of the genome is non-coding
non-synonymous SNPs • more studied • two types: - Missense : change results in a different amino acid - Nonsense : produce a premature stop codon• nonsense mutants result in early termination of the
protein, very often associated with disease• Missense SNPs alter the protein sequence without
destroying it, may or may not be disease associated • most methods estimate that only 25–30% of the nsSNPs
negatively affect protein function
Nucleotide polymorphisms• Identifying and annotating functional effects of SNPs and
MNPs is important in the gene prioritization • Genes selected for further disease-association studies are
more likely to contain a deleterious mutation • A number of methods were created for identifying
mutations as functionally deleterious in regulatory regions• Coding synonymous SNPs have recently been shown to
have the same chance of being involved in a disease as non-coding SNPs due to reasons such as codon usage bias
• Few computational methods are able make predictions with their functional effects
Text Evidence • Huge amounts of data could potentially improve the
performance of any gene prioritization method• Specialized tools can be used to prioritize diseases
associated genes• Researchers make their data computationally available
from databases • Depositing knowledge obtained through reading and
manual curation into databases
Text Evidence • Hidden in plain site in natural language text of scientific
publications • A casual search in PubMed for the term breast cancer
generates over two hundred thousand matches. Limiting the field to genetics of breast cancer reduces to fifty thousand.
• Scientific text mining tools allow for intelligent identification of possible gene-gene and disease-gene correlations
The Inputs and Outputs • Functionality of prioritization methods is defined by previously
known information about the disease and candidate search space • Disease information: disease-associated genes, affected tissues
and pathways and relevant keywords • candidate search space : - automatically selected by the tool (the entire genome) - submitted by the user (suspect genes)• providing a list of very broad keywords may reduce the performance
specificity, while incorrect candidate search space automatically decreases sensitivity
• Output is produced based on input, produce ranked/ordered lists of genes
• The prioritization accuracy depends on the accuracy and specificity of the inputs
The Processing • Gene prioritization methods use different algorithms to
product all the data they extracto mathematical/statistical models/methodso fuzzy logico artificial learning devicesoSome methods use combinations of the above. • No one methodology is better than the others for all data
inputs • refer to relevant tool publications and method-specific
literature to get details on computational methods used in the various approaches
Figure 5. Predicting gene-disease involvement using artificial neural networks (ANNs).
Bromberg Y (2013) Chapter 15: Disease Gene Prioritization. PLoS Comput Biol 9(4): e1002902. doi:10.1371/journal.pcbi.1002902http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1002902
Gene prioritization tools• Many different Gene prioritization tools have been
developed
• Endeavour: - a web resource for the prioritization of candidate gene - inferring several models (based on various genomic data sources) - applying each model to the candidate genes to rank those candidates against the profile of the known genes and - merging the several rankings into a global ranking of the candidate genes
Gene prioritization tools• G2D (genes to diseases): - a web resource for prioritizing candidates genes - uses three algorithms based on different prioritization strategies - input: (1) the genomic region where the user is looking for the disease-causing mutation, (2) an additional piece of information depending on the algorithm used - output in every case is an ordered list of candidate genes in the region of interest
Summary • Gene prioritization methods are developed to link genes
to diseases by extracting and combining the various information
• Rely on experimental work such as disease gene linkage analysis and genome wide studies to establish the search space of candidate genes
• Use mathematical and computational models of disease to filter the original set of genes based on gene and protein sequence, structure, function, interaction and expression information
• Translate the experimental data into legible disease-gene associations effectively