inbiomedvision workshop at mie 2011. victoria lópez

2nd Consortium Meeting, Barcelona 16th May, 2011

Victoria López Alonso PhDMedical Bioinformátics AreaInstituto de Salud Carlos III

Spain

Victoria López Alonso PhDMedical Bioinformátics AreaInstituto de Salud Carlos III

Spain

Bioinformatics challenges in apersonalized medicine pipelineBioinformatics challenges in apersonalized medicine pipeline

Workshop INBIOMEDvision, MIE 2011


Bridging gaps between Bioinformatics and MI

BMI deals with the integrative management and synergic exploitation of the wide and inter-related scope of information that is generated and needed in healthcare settings, biomedical research institutions and health-related industry.


Overview:Personalized medicine in current practice1- Processing large-scale genomic data2- Interpretation of functional effect of genomic variation3- Integration of systems data4- Translation into medical practice

Bioinformatics challenges for Personalized medicine


Personalized medicine in current practice

Translational bioinformatics utilizes computational tools for the analysis of large biological databases and to fully comprehend disease mechanisms by not only understanding the genetics and the proteomics but also by associating them with the clinical data.

Translational bioinformatics utilizes computational tools for the analysis of large biological databases and to fully comprehend disease mechanisms by not only understanding the genetics and the proteomics but also by associating them with the clinical data.


Advances of molecular science•Human Genome Project in 2003

Finishing the euchromatic sequence of the human genome. Nature 2004; 431 (7011): 931-945.

•Phase I HapMap project in 2005Phase II and Phase III

A haplotype map of the human genome.Nature 2005: 437(7063):1299-1320

•Encyclopedia of DNA Elements (ENCODE) project in 2007Identification and analysis of functional elements in 1% of the human genome.

Nature 2007; 447(7146):799-816

•1000 Genomes Project in 2008DNA sequences. A plan to capture human diversity in 1000 genomes.

Science 2008; 319(5863):395

$1000 Genome in …2013 ??



Chemotherapy medications trastuzumab and Imatinib (Gambacorti-Passerini, 2008; Hudis, 2007)

Targeted pharmacogenetic dosing algorithm is used for warfarin (International Warfarin Pharmacogenetics Consortium et al., 2009)

Incidence of adverse events for drugs Abacavir, Carbamazepine and Clozapine (Dettling et al., 2007; Ferrell and McLeod, 2008, 2002).

The inclusion of genetics in EHRs will provide risk assesment. Clinical assessment incorporating a personal genome. Ashley et al. Lancet (2010)

The inclusion of genetics in EHRs will provide risk assesment. Clinical assessment incorporating a personal genome. Ashley et al. Lancet (2010)


Bentley D. “Genomes for Medicine”. (2004). Nature Insight 429, p440-446

Today patient´s genetics are consulted only for few diagnoses and treatments and only in certain medical centers (cystic fribrosis , breast cancer)

Today patient´s genetics are consulted only for few diagnoses and treatments and only in certain medical centers (cystic fribrosis , breast cancer)

With easy access to a well annotated human genome an individual could adquire a genetic health profile including risk and resistance factors that could be used to guide medical decisions.



1- Processing large-scale genomic data2- Interpretation of functional effect of genomic variation3- Integration of systems data4- Translation into medical practice

Bioinformatics challenges for Personalized medicine

Different informatics challenges should be addressed to create the tools to tailor medical care to each individual genome and also to realize the potential of personalized medicine


SNPs (Single Point Polymorphims) are key enablers in realizing the concept of personalized medicine.

Sequencing technologies are becoming accessibleWhole genome < 2 weeks1 error per 100 kb-------30.000 erroneous variant calls

The error rate of these technologies is a source of significant challenges in applications, including discovering novel variants

1-Processing large-scale genomic data

SNP: frequency in the human population is higher than 1%


100.000 and 300.000 previously undiscovered SNPsVariant discovery---”needle in a haystack”

Verification of novel variants due to the false positive rate

In addition there are other important classes of variations for clinical applications:

short insertion–deletion variants (indels), copy number variants (CNVs) structural variants (SVs)

1-Processing large-scale genomic data

New algorithms to detect these variations from sequencing data


1- Processing large-scale genomic data

High quality sequence reads must be placed into their genomic context to identify variants.

The challenge is to develop new algorithms to do the “novo assembly” computationally possible.De novo assembly is slow and complicated by repetitive elements.

Sequences are mapped to a genomic reference sequence:BLAST have been traditionally used, but their execution speed depends on the genome size.


1- Processing large-scale genomic data

New Mapping and alignment algorithmsBLAT indexed version of the genome (Kent, 2002). Burrows-Wheeler Aligner (BWA) (Li and Homer, 2010).

Ideally performed in a cluster or by using cloud computingProgram must allow for mismatches without resulting in false alignments Improving of quality control metrics: ratios of base transition, Mendelian inheritance errors (MIE), relative quality scores…


2- Interpretation of functional effect

After genomic data has been processed, the functional effect and the impact of the genetic variation must be analyzed Genome-wide association studies (GWASs) have been used to assess the statistical associations of SNPs with many important common diseases. GWAS provides new insights but only a limited number of variants have been characterized and understanding the functional relationship between variants and phenotypes.

https://www.wtccc.org.uk



Important issues for predicting the effect of SNPs are data management, retrieval and quality control.

SNP databases:•The dbSNP database (20 millions of validated SNPs)•The Human Gene Mutation Database (HGMD) (SNPs associated with diseases)•SwissVar•Online Mendelian Inheritance in Man (OMIM) database•PharmGKB database•Catalogue of Somatic Mutations in Cancer (COSMID)

Number of known SNPsFernald et al. 2011


2-Interpretation of functional effectComputational methods to predict mSNPs:

•Empirical rules (Ng and Henikoff, 2003; Ramensky et al., 2002),

•Hidden Markov Models (HMMs) (Thomas and Kejariwal, 2004),

•Neural Networks (Bromberg et al., 2008; Ferrer-Costa et al., 2005),

•Decision Trees (Dobson et al., 2006; Krishnan and Westhead, 2003),

•Random Forests (Li et al., 2009; Wainreb et al., 2010)

•Support Vector Machines (Calabrese et al., 2009).

The prediction algorithms input features include:•amino acid sequence •protein structure •evolutionary information


New algorithms that include knowledge-based information are being developed on evolutionary information for the prediction of SNPs:

•PANTHER uses a library of protein family HMM. http://www.pantherdb.org/•PolyPhen uses different sequence-based features.http://genetics.bwh.harvard.edu/pph•MutPred evaluates the probabilities of gain or loss of structure and function upon mutations using random forest. http://mutdb.org/mutpred •SIFT uses a multiple sequence alignment between homolog proteins. http://sift.jcvi.org •SNAP Sequence http://rostlab.org/services/snap

•SNPEffect http://snpeffect.vib.be•SNPs3D Structure-based SVM predictorhttp://www.snps3d.org

New algorithms that include knowledge-based information are being developed on evolutionary information for the prediction of SNPs:

•PANTHER uses a library of protein family HMM. http://www.pantherdb.org/•PolyPhen uses different sequence-based features.http://genetics.bwh.harvard.edu/pph•MutPred evaluates the probabilities of gain or loss of structure and function upon mutations using random forest. http://mutdb.org/mutpred •SIFT uses a multiple sequence alignment between homolog proteins. http://sift.jcvi.org •SNAP Sequence http://rostlab.org/services/snap

•SNPEffect http://snpeffect.vib.be•SNPs3D Structure-based SVM predictorhttp://www.snps3d.org



2- Interpretation of functional effectExperimental test are required to validate genetic predictions. There are is a need for fast and accurate methods for gene prioritization

Eleftherohorinou et al., 2010Eleftherohorinou et al., 2010

Currently the most effective strategy uses the concept of genes that are linked to the biological process of interest.The input data for gene priorization is the functional annotation, the protein–protein interactions, biological pathways and literature.



•SUSPECT: sequence features, gene expression data, functional terms…•ToppGene: mouse phenotype data with human gene annotations and literature•MedSim: human disease genes with mouse genes•ENDEAVOUR: genes involved in a known biological process•G2D and PolySearch data mining on biological databases•MimMiner: text mining comparing the human phenome and disease phenotypes• PhenoPred : uses protein sequence and function •GeneMANIA : uses functional assays

The Gene Priorization Portal provides comprehensive descriptions of available predictors:

http://homes.esat.kuleuven.be/~bioiuser/gpp/index.php



Last year, the first edition of the Critical Assessment of Genome Interpretation (CAGI) was organized to assess the available methods for predicting phenotypic impact of genomic variation and to stimulate future research.

http://genomeinterpretation.org/


3-Integration of systems data

There is concern that pharmacogenomics GWAS themselves are susceptible to many limitations:

insufficient sample size, selection biases for genetic variants and environmental interactions may affect the outcome measuresMultiple gene–gene interactions may underlie unexplained. HapMap Project, 2004 HapMap Project, 2004


3- Integration of systems dataModel Selection Methods have been successful with disease and trait GWAS studies using selection techniques to choose multifactorial models that balance the false positive rate, statistical power and computational requirements of the search

Dimensionality reduction methods•Principal Components Analysis•Information Gain and •Multifactor Dimensionality Reduction (ie. hypertension and familial amyloid polyneuropathy type I)

Ritchie and Monsimger, 2010Ritchie and Monsimger, 2010


3- Integration of systems data

Naylor and Chen, 2010Naylor and Chen, 2010

No external knowledge sources informs about the biology behind the interactions.

Systems biology and network approaches address to the problem of complexity integrating molecular data at multiple levels of biology including genomes, transcriptomes, metabolomes, proteomes and functional and regulatory networks.



The simple “one SNP, one phenotype” approach is insufficient.Most medically relevant phenotypes are thought to be the result of gene–gene and gene–environment interactions

Adeyemo et al., 2010Adeyemo et al., 2010



Limdi and Veenstra , 2008Limdi and Veenstra , 2008

Drug response often depends on multiple pharmacokinetic and pharmacodynamic interactions .

Some success: studies of warfarin have linked the majority of variation in response to two genes, CYP2C9 and VKORC1. Improved dosing algorithm.

Drug response often depends on multiple pharmacokinetic and pharmacodynamic interactions .

Some success: studies of warfarin have linked the majority of variation in response to two genes, CYP2C9 and VKORC1. Improved dosing algorithm.

2nd Consortium Meeting, Barcelona 16th May, 2011Goh et al., 2007 Goh et al., 2007


•Disease–Gene Networks •Chemical structures, Diseases and Protein sequences •Epigenetic data and Drug Phenotypes•Pathways and Gene sets

Gene Set Enrichment Analysis (GSEA) SNP Ratio Test The Prioritizing Risk Pathways method

Assumptions must also be examined carefully ¡¡¡

•Disease–Gene Networks •Chemical structures, Diseases and Protein sequences •Epigenetic data and Drug Phenotypes•Pathways and Gene sets

Gene Set Enrichment Analysis (GSEA) SNP Ratio Test The Prioritizing Risk Pathways method

Assumptions must also be examined carefully ¡¡¡

Combining disparate data sources can result in novel associations


4- Translation into medical practiceMuch of this research has yet to be translated to the clinic for improved patient care. One of the areas where bioinformatics can have the greatest clinical impact is pharmacogenomics improving drug prescription and dosing.Pharmacogenomic prescription and dosing algorithms need to be accessible to physicians.

Much of this research has yet to be translated to the clinic for improved patient care. One of the areas where bioinformatics can have the greatest clinical impact is pharmacogenomics improving drug prescription and dosing.Pharmacogenomic prescription and dosing algorithms need to be accessible to physicians. Martin-Sanchez et al. 2006

Warfarindosing could save up to 60% of the cost and reduce possible adverse events


Medical practice needs to be updated to include routine pharmacogenetic testing, educating and training physicians in personalized medicine, and futher clinical trials to prove the efficacy of predictions

Bioinformatics also translates discoveries to the clinic by disseminating discoveries through curated, searchable databases

Medical practice needs to be updated to include routine pharmacogenetic testing, educating and training physicians in personalized medicine, and futher clinical trials to prove the efficacy of predictions

Bioinformatics also translates discoveries to the clinic by disseminating discoveries through curated, searchable databases

4-Translation into medical practice

http://pacdb.org/http://pacdb.org/

http://www.pharmgkb.org/http://www.pharmgkb.org/

The database of Genotypes and Phenotypes

The Pharmakogenomics Knowledge Database

Pharmacogenetics-Cell line database

www.ncbi.nlm.nih.gov/gapwww.ncbi.nlm.nih.gov/gap

The Adverse Event Reporting System (AERS)www.fda.gov/Drugs/www.fda.gov/Drugs/


Biologically and medically focused text mining algorithms can speed the collection of this structured data, such as methods that use sentence syntax and natural language processing to derive drug–gene and gene–gene interactions from scientific literature.

Opportunities for bioinformatics to integrate with the electronic medical record (EMR)

Biologically and medically focused text mining algorithms can speed the collection of this structured data, such as methods that use sentence syntax and natural language processing to derive drug–gene and gene–gene interactions from scientific literature.

Opportunities for bioinformatics to integrate with the electronic medical record (EMR)

4- Translation into medical practice

www.mc.vanderbilt.edu/www.mc.vanderbilt.edu/

www.phenx.org/ www.phenx.org/

BioBank system at VanderbiltBioBank system at Vanderbilt

RTI International with NHGRIRTI International with NHGRI


http://biotic.isciii.es/

[email protected]

Instituto de Salud Carlos IIIMedical Bioinformatics Area

Thanks ¡¡¡

http://www.inbiomedvision.eu/

inbiomedvision workshop at mie 2011. victoria lópez

Technology