introduction to bioinformatics for uva cell bio 8401

73
Introduction to Bioinformatics Stephen Turner, Ph.D. Bioinformatics Core Director [email protected] Slides at bit.ly/intro-bioinfo

Upload: stephen-turner

Post on 27-Jan-2015

128 views

Category:

Education


3 download

DESCRIPTION

Introduction to Bioinformatics for UVA Cell Bio 8401

TRANSCRIPT

Page 1: Introduction to Bioinformatics for UVA Cell Bio 8401

Introduction to Bioinformatics

Stephen Turner, Ph.D.Bioinformatics Core [email protected]

Slides at bit.ly/intro-bioinfo

Page 2: Introduction to Bioinformatics for UVA Cell Bio 8401

Contact

Web: bioinformatics.virginia.edu

E-mail: [email protected]

Blog: GettingGeneticsDone.com

Twitter: @genetics_blog

Page 3: Introduction to Bioinformatics for UVA Cell Bio 8401

Bioinformatics Origins:

Rooted in sequence analysis.

Driven by the need to:● Collect● Annotate● Analyze

Page 4: Introduction to Bioinformatics for UVA Cell Bio 8401

Margaret Dayhoff (1925-1983)

● Collected all known protein structures & sequences

● Published Atlas in 1965● Pioneered algorithm development

for:○ Comparing protein sequences○ Deriving evolutionary history from

alignments

“In this paper we shall describe a completed computer program for the IBM 7090, which to our knowledge is the first successful attempt at aiding the analysis of the amino acid chain structure of protein.”

Page 5: Introduction to Bioinformatics for UVA Cell Bio 8401

IBM 7090

Page 6: Introduction to Bioinformatics for UVA Cell Bio 8401

“There is a tremendous amount of information regarding evolutionary history and biochemical

function implicit in each sequence and the number of known sequences is growing

explosively. We feel it is important to collect this significant information, correlate it into a

unified whole and interpret it.”

M. Dayhoff, February 27, 1967

Page 7: Introduction to Bioinformatics for UVA Cell Bio 8401

modified from @drewconway

Page 8: Introduction to Bioinformatics for UVA Cell Bio 8401

1960 1970 1980 1990 2000 2010

Dayho

ff Atla

s

Sange

r Seq

uenc

ing

GenBan

k

EBI-EMBL

Next-G

en S

eque

ncing

Intern

et inv

ented

ARPAnet

WW

W in

vente

d

Page 9: Introduction to Bioinformatics for UVA Cell Bio 8401
Page 10: Introduction to Bioinformatics for UVA Cell Bio 8401

DefinitionFrom Wikipedia: Bioinformatics is a branch of biological science which deals with the study of methods for storing, retrieving and analyzing biological data, such as nucleic acid (DNA/RNA) and protein sequence, structure, function, pathways and genetic interactions. It generates new knowledge that is useful in such fields as drug design and development of new software tools to create that knowledge. Bioinformatics also deals with algorithms, databases and information systems, web technologies, artificial intelligence and soft computing, information and computation theory, structural biology, software engineering, data mining, image processing, modeling and simulation, discrete mathematics, control and system theory, circuit theory, and statistics.

Our definition: using computer science and statistics to answer biological questions.

Page 11: Introduction to Bioinformatics for UVA Cell Bio 8401

Subdisciplines

● Sequence alignment (DNA, RNA, Protein)● Genome annotation● Evolutionary biology / comparative genomics● Analysis of gene expression● Analysis of gene regulation● Genotype-phenotype association● Mutation analysis● Structural biology● Biomarker identification● Pathway analysis / "systems biology"● Literature analysis / text-mining

Page 12: Introduction to Bioinformatics for UVA Cell Bio 8401

Central Dogma

DNA RNA Protein

Post-translational modification

PrionsReverse

transcription

Methylation

RNA Silencing

Page 13: Introduction to Bioinformatics for UVA Cell Bio 8401

DNA provides assembly instructions for proteins

Protein folding determines molecular function

Networks of interacting proteins determine

tissue/organ function

Page 14: Introduction to Bioinformatics for UVA Cell Bio 8401

DNA provides assembly instructions for proteins

Protein folding determines molecular function

Networks of interacting proteins determine

tissue/organ function

DNA variant analysisGene expression analysis

Genome annotationEpigenetics

Pathway analysisSystems biologyBiomarker ID'n

miRNA analysisQuantitative MS

Proteomics

Page 15: Introduction to Bioinformatics for UVA Cell Bio 8401

Subdisciplines

● Sequence alignment (DNA, RNA, Protein)● Genome annotation● Evolutionary biology / comparative genomics● Analysis of gene expression● Analysis of gene regulation● Genotype-phenotype association● Mutation analysis● Structural biology● Biomarker identification● Pathway analysis / "systems biology"● Literature analysis / text-mining

Page 16: Introduction to Bioinformatics for UVA Cell Bio 8401

Outbreak: fever, characteristic skin lesions.

Culture, isolate DNA, sequence (sanger):GTGAGTAATAATAATTCAAAACTGGAATTTGTACCTAATATACAGCTTAAAGAAGACTTAGGAGCTTTTAGCTATAAAGTCCAACTTTCT

CCTGTAGAAAAAGGTATGGCTCATATCCTTGGTAACTCTATTAGAAGGGTTTTATTATCTTCACTATCAGGTGCATCTATAATTAAAGTA

AACATCGCTAATGTACTACATGAGTATTCTACTTTAGAAGATGTAAAAGAAGATGTTGTTGAAATTGTTTCTAATTTGAAAAAGGTTGCG

ATAAAGCTTGATACAGGTATAGATAGACTAGATTTAGAACTATCTGTAAATAAATCAGGTGTAGTTAGCGCTGGAGATTTTAAGACGACT

CAAGGTGTAGAAATAATAAATAAAGATCAGCCAATAGCTACTTTGACAAACCAAAGAGCATTTAGCTTAACTGCTACAGTGAGTGTAGGT

AGAAATGTCGGAATACTTTCTGCGATACCAACCGAGCTTGAGAGAGTTGGTGATATAGCTGTAGATGCTGATTTTAATCCTATTAAAAGA

GTTGCTTTTGAGGTTTTTGATAATGGTGATAGTGAAACTTTAGAAGTATTTGTAAAGACAAATGGTACTATAGAACCACTAGCAGCTGTT

ACGAAAGCTTTAGAGTATTTCTGTGAGCAAATATCAGTATTTGTATCTCTAAGAGTACCTAGTAATGGTAAAACAGGTGATGTATTAATA

GATTCTAATATTGATCCTATCCTTCTTAAGCCGATTGATGATTTAGAGCTAACTGTCAGATCATCTAACTGTCTGCGTGCAGAAAACATT

AAGTATCTTGGTGATTTGGTACAGTATTCTGAATCACAGCTTATGAAGATACCTAACTTAGGTAAGAAATCTCTCAATGAGATCAAACAA

ATTTTAATAGATAATAACTTGTCTCTAGGTGTCCAAATTGACAATTTTAGAGAGCTAGTTGAAGGAAAATAA

Sequence alignment, example 1

Page 17: Introduction to Bioinformatics for UVA Cell Bio 8401

Sequence alignment, example 1

● BLAST (Basic Local Alignment Search Tool)● Go to blast.ncbi.nlm.nih.gov● Click "Nucleotide BLAST" (blastn)● Under "Choose Search Set", click the

"Others" button, then search the entire nr/nt collection (you don't know what it is)GTGAGTAATAATAATTCAAAACTGGAATTTGTACCTAATATACAGCTTAAAGAAGACTTAGGAGCTTTTAGCTATAAAGTCCAACTTTCT

CCTGTAGAAAAAGGTATGGCTCATATCCTTGGTAACTCTATTAGAAGGGTTTTATTATCTTCACTATCAGGTGCATCTATAATTAAAGTA

AACATCGCTAATGTACTACATGAGTATTCTACTTTAGAAGATGTAAAAGAAGATGTTGTTGAAATTGTTTCTAATTTGAAAAAGGTTGCG

ATAAAGCTTGATACAGGTATAGATAGACTAGATTTAGAACTATCTGTAAATAAATCAGGTGTAGTTAGCGCTGGAGATTTTAAGACGACT

CAAGGTGTAGAAATAATAAATAAAGATCAGCCAATAGCTACTTTGACAAACCAAAGAGCATTTAGCTTAACTGCTACAGTGAGTGTAGGT

AGAAATGTCGGAATACTTTCTGCGATACCAACCGAGCTTGAGAGAGTTGGTGATATAGCTGTAGATGCTGATTTTAATCCTATTAAAAGA

GTTGCTTTTGAGGTTTTTGATAATGGTGATAGTGAAACTTTAGAAGTATTTGTAAAGACAAATGGTACTATAGAACCACTAGCAGCTGTT

ACGAAAGCTTTAGAGTATTTCTGTGAGCAAATATCAGTATTTGTATCTCTAAGAGTACCTAGTAATGGTAAAACAGGTGATGTATTAATA

GATTCTAATATTGATCCTATCCTTCTTAAGCCGATTGATGATTTAGAGCTAACTGTCAGATCATCTAACTGTCTGCGTGCAGAAAACATT

AAGTATCTTGGTGATTTGGTACAGTATTCTGAATCACAGCTTATGAAGATACCTAACTTAGGTAAGAAATCTCTCAATGAGATCAAACAA

ATTTTAATAGATAATAACTTGTCTCTAGGTGTCCAAATTGACAATTTTAGAGAGCTAGTTGAAGGAAAATAA

Page 18: Introduction to Bioinformatics for UVA Cell Bio 8401
Page 19: Introduction to Bioinformatics for UVA Cell Bio 8401
Page 20: Introduction to Bioinformatics for UVA Cell Bio 8401

Sequence alignment, example 2

● Illumina HiSeq 2500:○ 600,000,000,000 bases sequenced in single run.○ 6,000,000,000 x 100-bp (short) reads

● BLAST way too slow.● BWA: burrows wheeler aligner (fast)● Bowtie: fast, memory-efficient (aligns

25,000,000 35-bp reads per hour per CPU).● Many others... MAQ, Eland, RMAP, SOAP,

SHRiMP, BFAST, Mosaik, Novoalign, BLAT, GMAP, GSNAP, MOM, QPalma, SeqMap, VelociMapper, Stampy, mrFAST, etc.

Page 21: Introduction to Bioinformatics for UVA Cell Bio 8401

Subdisciplines

● Sequence alignment (DNA, RNA, Protein)● Genome annotation● Evolutionary biology / comparative genomics● Analysis of gene expression● Analysis of gene regulation● Genotype-phenotype association● Mutation analysis● Structural biology● Biomarker identification● Pathway analysis / "systems biology"● Literature analysis / text-mining

Page 22: Introduction to Bioinformatics for UVA Cell Bio 8401

Comparative Genomics example

● Go to genome.ucsc.edu ● Search for POLR2A● Turn on some conservation tracks

Page 23: Introduction to Bioinformatics for UVA Cell Bio 8401

Sequence similarityEvolutionary distance

Page 24: Introduction to Bioinformatics for UVA Cell Bio 8401

Subdisciplines

● Sequence alignment (DNA, RNA, Protein)● Genome annotation● Evolutionary biology / comparative genomics● Analysis of gene expression● Analysis of gene regulation● Genotype-phenotype association● Mutation analysis● Structural biology● Biomarker identification● Pathway analysis / "systems biology"● Literature analysis / text-mining

Page 25: Introduction to Bioinformatics for UVA Cell Bio 8401

Genetic Epidemiology

Epidemiology: the study of the patterns, causes, and effects of health and disease conditions in defined populations.

Genetic epidemiology: the study of genetic factors in determining health and disease in families and populations.

Page 26: Introduction to Bioinformatics for UVA Cell Bio 8401

DNA provides assembly instructions for proteins

Protein folding determines molecular function

Networks of interacting proteins determine

tissue/organ function

Page 27: Introduction to Bioinformatics for UVA Cell Bio 8401

Genetic epidemiology

● Linkage: finding genetic loci that segregate with the disease in families.

● Association: finding alleles that co-occur with disease in populations.○ Common disease - common variant hypothesis:

■ Common variants (e.g. >1-5% in the population) contribute to common, complex disease).

○ Common disease - rare variant hypothesis:■ Polymorphisms that cause disease are under

purifying selection, and will thus be rare. ○ Really, it's a mix of both

Page 28: Introduction to Bioinformatics for UVA Cell Bio 8401

Candidate gene study

● Select candidate genes based on:○ Known biology○ Previous linkage/association evidence○ Pathways○ Evidence from model organisms

● Genotype variants (SNPs) in those genes● Statistical association

Genotype at position rs12345: A/TGenotype at position rs12345: A/A Genotype at position rs12345: T/T

Page 29: Introduction to Bioinformatics for UVA Cell Bio 8401

Genome-wide association study

● Genotype >500,000 SNPs● Statistical test at each one● Manhattan plot of results● GWAS does not inform:

○ Which gene affected○ How gene function perturbed○ How biological function altered

Page 30: Introduction to Bioinformatics for UVA Cell Bio 8401

Subdisciplines

● Sequence alignment (DNA, RNA, Protein)● Genome annotation● Evolutionary biology / comparative genomics● Analysis of gene expression● Analysis of gene regulation● Genotype-phenotype association● Mutation analysis● Structural biology● Biomarker identification● Pathway analysis / "systems biology"● Literature analysis / text-mining

Page 31: Introduction to Bioinformatics for UVA Cell Bio 8401

Gene expression pre-2008PCR Microarrays

Page 32: Introduction to Bioinformatics for UVA Cell Bio 8401

Exercise (Thursday)

● Download R: r-project.org● Download Rstudio: rstudio.com● Get data: http://people.virginia.edu/~sdt5z/GSE4107_RAW.zip

● Run code to download BioC packages:○ source("http://bioconductor.org/biocLite.R")○ biocLite()○ biocLite(c("affy", "AnnotationDbi", "hgu133plus2cdf",

"hgu133plus2.db", "genefilter", "DBI", "annotate", "arrayQualityMetrics", "limma", "GOstats", "Category", "GO.db", "KEGG.db"))

Page 33: Introduction to Bioinformatics for UVA Cell Bio 8401

Gene expression pre-2008PCR Microarrays

Page 34: Introduction to Bioinformatics for UVA Cell Bio 8401

RNA sequencing (RNA-seq)

Condition 1(normal colon)

Condition 2(colon tumor)

Isolate RNAs

Sequence ends

100s of millions of paired reads10s of billions bases of sequence

Generate cDNA, fragment, size select, add linkersSamples of interest

Align to Genome

Downstream analysis

Image: www.bioinformatics.ca

Page 35: Introduction to Bioinformatics for UVA Cell Bio 8401

RNA-seq advantages

● No reference necessary● Low background (no cross-hybridization)● Unlimited dynamic range (FC 9000 Science 320:1344)● Direct counting (microarrays: indirect – hybridization)● Can characterize full transcriptome

○ mRNA and ncRNA (miRNA, lncRNA, snoRNA, etc)○ Differential gene expression○ Differential coding output○ Differential TSS usage○ Differential isoform expression

Page 36: Introduction to Bioinformatics for UVA Cell Bio 8401

Isoform level data

Page 37: Introduction to Bioinformatics for UVA Cell Bio 8401

Isoform level data

Page 38: Introduction to Bioinformatics for UVA Cell Bio 8401

Differential splicing & TSS use

Page 39: Introduction to Bioinformatics for UVA Cell Bio 8401

RNA-seq challenges

● Library construction○ Size selection (messenger, small)○ Strand specificity?

● Bioinformatic challenges○ Spliced alignment○ Transcript deconvolution

● Statistical Challenges○ Highly variable abundance○ Sample size: never, ever, plan n=1

● Normalization (RPKM)○ Compare features of different lengths○ Compare conditions with different

sequence depth

Page 40: Introduction to Bioinformatics for UVA Cell Bio 8401

Common question #1: Depth

● Question: how much sequence do I need?● Answer: it’s complicated.● Depends on:

○ Size & complexity of transcriptome○ Application: differential gene expression, transcript

discovery, aberrant splicing, etc.○ Tissue type, RNA quality, library preparation○ Sequencing type: length, single-/paired-end, etc.

● Find publication in your field w/ similar goals.● Good news: 1 GA or ½ HiSeq lane is

sufficient for most applications

Page 41: Introduction to Bioinformatics for UVA Cell Bio 8401

Common question #2: Sample Size

● Question: How many samples should I sequence?

● Oversimplified Answer: At least 3 biological replicates per condition.

● Depends on:○ Sequencing depth○ Application○ Goals (prioritization, biomarker discovery, etc.)○ Effect size, desired power, statistical significance

● Find a publication with similar goals

Page 42: Introduction to Bioinformatics for UVA Cell Bio 8401

Common question #3: Workflow

● How do I analyze the data?● No standards!

○ Unspliced aligners: BWA, Bowtie, Stampy, SHRiMP○ Spliced aligners: Tophat, MapSplice, SpliceMap, GSNAP, QPALMA○ Reference builds & annotations: UCSC, Entrez, Ensembl○ Assembly: Cufflinks, Scripture, Trinity, G.Mor.Se, Velvet, TransABySS○ Quantification: Cufflinks, RSEM, MISO, ERANGE, NEUMA, Alexa-Seq○ Differential expression: Cuffdiff, DegSeq, DESeq, EdgeR, Myrna

● Like early microarray days: lots of excitement, lots of tools, little knowledge of integrating tools in pipeline!

● Benchmarks● Microarray: Spike-ins (Irizarry)● RNA-Seq: ???, simulation, ???

Page 43: Introduction to Bioinformatics for UVA Cell Bio 8401

Phases of NGS analysis

● Primary○ Conversion of raw machine signal into sequence and qualities

● Secondary○ Alignment of reads to reference genome or transcriptome○ De novo assembly of reads into contigs

● Tertiary○ SNP discovery/genotyping○ Peak discovery/quantification (ChIP, MeDIP)○ Transcript assembly/quantification (RNA-seq)

● Quaternary○ Differential expression○ Enrichment, pathways, correlation, clustering, visualization, etc.

Page 44: Introduction to Bioinformatics for UVA Cell Bio 8401

Extra credit (not really): RNA-seqhttp://bit.ly/galaxy-rnaseq

● #1: learn to use galaxy: bit.ly/uva-galaxy● #2: Run through an RNA-seq exercise in 1 hour:

○ Read some background material on RNA-seq○ Read the tophat/cufflinks method paper○ Get some data (Illumina BodyMap)○ QC / trim your reads○ Map to hg19 with tophat○ Visualize where reads map○ Assemble with cufflinks○ Differential expression with cuffdiff

Page 45: Introduction to Bioinformatics for UVA Cell Bio 8401

Subdisciplines

● Sequence alignment (DNA, RNA, Protein)● Genome annotation● Evolutionary biology / comparative genomics● Analysis of gene expression● Analysis of gene regulation● Genotype-phenotype association● Mutation analysis● Structural biology● Biomarker identification● Pathway analysis / "systems biology"● Literature analysis / text-mining

Page 46: Introduction to Bioinformatics for UVA Cell Bio 8401

How are genes regulated?

● Transcription factors (ChIP-seq)● Micro-RNAs (RNA-seq)● Chromatin accessibility (DNAse-Seq)● DNA Methylation (RRBS-seq, MeDIP-seq)● RNA processing● RNA transport● Translation● Post-translational modification

Page 47: Introduction to Bioinformatics for UVA Cell Bio 8401

Importance of DNA methylation

● Occurs most frequently at CpG sites● High methylation at promoters ≈ silencing● Methylation perturbed in cancer● Methylation associated with many other

complex diseases: neural, autoimmune, response to env.

● Mapping DNA methylation → new disease genes & drug targets.

Page 48: Introduction to Bioinformatics for UVA Cell Bio 8401

DNA Methylation Challenges

● Dynamic and tissue-specific● DNA → Collection of cells which vary in

5meC patterns → 5meC pattern is complex.● Further, uneven distribution of CpG targets● Multiple classes of methods:

○ Bisulfite, sequence-based: Assay methylated target sequences across individual DNAs.

○ Affinity enrichment, count-based: Assay methylation level across many genomic loci.

● Many methods● Many algorithms

Page 49: Introduction to Bioinformatics for UVA Cell Bio 8401

Many methylation methods

BS-Seq Whole-genome bisulfite sequencingRRBS-Seq Reduced representation bisulfite sequencingBC-Seq Bisulfite capture sequencingBSPP Bisulfite specific padlock probesMethyl-Seq Restriction enzyme based methyl-seqMSCC Methyl sensitive cut countingHELP-Seq HpaII fragment enrichment by ligation PCRMCA-Seq Methylated CpG island amplificationMeDIP-Seq Methylated DNA immunoprecipitationMBP-Seq Methyl-binding protein sequencingMethylCap-seq Methylated DNA capture by affinity purificationMIRA-Seq Methylated CpG island recovery assay

RNA-Seq High-throughput cDNA sequencing

DNAMethylation

GeneExpression

Page 50: Introduction to Bioinformatics for UVA Cell Bio 8401

Methylation methods: Features & biases

Page 51: Introduction to Bioinformatics for UVA Cell Bio 8401

Methylation: Bioinformatics ResourcesResource Purpose URL Refs

Batman MeDIP DNA methylation analysis tool http://td-blade.gurdon.cam.ac.uk/software/batman

BDPC DNA methylation analysis platform http://biochem.jacobs-university.de/BDPCBSMAP Whole-genome bisulphite sequence mapping http://code.google.com/p/bsmapCpG Analyzer Windows-based program for bisulphite DNA -CpGcluster CpG island identification http://bioinfo2.ugr.es/CpGclusterCpGFinder Online program for CpG island identification http://linux1.softberry.com

CpG Island Explorer Online program for CpG Island identification http://bioinfo.hku.hk/cpgieintro.htmlCpG Island Searcher Online program for CpG Island identification http://cpgislands.usc.eduCpG PatternFinder Windows-based program for bisulphite DNA -

CpG Promoter Large-scale promoter mapping using CpG islands http://www.cshl.edu/OTT/html/cpg_promoter.html

CpG ratio and GC content Plotter Online program for plotting the observed:expected ratio of CpG http://mwsross.bms.ed.ac.uk/public/cgi-bin/cpg.plCpGviewer Bisulphite DNA sequencing viewer http://dna.leeds.ac.uk/cpgviewer

CyMATE Bisulphite-based analysis of plant genomic DNA http://www.gmi.oeaw.ac.at/en/cymate-index/

EMBOSS CpGPlot/ CpGReport Online program for plotting CpG-rich regions http://www.ebi.ac.uk/Tools/emboss/cpgplot/index.htmlEpigenomics Roadmap NIH Epigenomics Roadmap Initiative homepage http://nihroadmap.nih.gov/epigenomicsEpinexus DNA methylation analysis tools http://epinexus.net/home.htmlMEDME Software package (using R) for modelling MeDIP experimental data http://espresso.med.yale.edu/medmemethBLAST Similarity search program for bisulphite-modified DNA http://medgen.ugent.be/methBLASTMethDB Database for DNA methylation data http://www.methdb.deMethPrimer Primer design for bisulphite PCR http://www.urogene.org/methprimer

methPrimerDB PCR primers for DNA methylation analysis http://medgen.ugent.be/methprimerdbMethTools Bisulphite sequence data analysis tool http://www.methdb.deMethyCancer Database Database of cancer DNA methylation data http://methycancer.psych.ac.cnMethyl Primer Express Primer design for bisulphite PCR http://www.appliedbiosystems.com/

Methylumi Bioconductor pkg for DNA methylation data from Illumina http://www.bioconductor.org/packages/bioc/html/

Methylyzer Bisulphite DNA sequence visualization tool http://ubio.bioinfo.cnio.es/Methylyzer/main/index.html

mPod DNA methylation viewer integrated w/ Ensembl genome browser http://www.compbio.group.cam.ac.uk/Projects/PubMeth Database of DNA methylation literature http://www.pubmeth.orgQUMA Quantification tool for methylation analysis http://quma.cdb.riken.jpTCGA Data Portal Database of TCGA DNA methylation data http://cancergenome.nih.gov/dataportal

Page 52: Introduction to Bioinformatics for UVA Cell Bio 8401

Subdisciplines

● Sequence alignment (DNA, RNA, Protein)● Genome annotation● Evolutionary biology / comparative genomics● Analysis of gene expression● Analysis of gene regulation● Genotype-phenotype association● Mutation analysis● Structural biology● Biomarker identification● Pathway analysis / "systems biology"● Literature analysis / text-mining

Page 53: Introduction to Bioinformatics for UVA Cell Bio 8401

Jeong, H. et al.. (2001) Nature 411:41–42.

Ptacek, J. et al. (2005) Nature 438:679–684. Guimera and Amaral. (2005). Nature 433:895-900. Tong, A.H. et al. (2001). Science 294:2364-2368.

Zhu X. et al. (2007). Genes & Dev 21:1010-1024.

One gene, one enzyme, one function?

Page 54: Introduction to Bioinformatics for UVA Cell Bio 8401

Distribution of disease genes

Diseases connected if same gene implicated in both.

Genes connected if implicated in the same disorder.

Goh et al. (2007). PNAS 104:8685.

Page 55: Introduction to Bioinformatics for UVA Cell Bio 8401

Distribution of disease genes

Genes connected if implicated in the same disorder.

Goh et al. (2007). PNAS 104:8685.

Overlay with PPI data

Genes contributing to a common disease interact through protein-

protein interactions.

Page 56: Introduction to Bioinformatics for UVA Cell Bio 8401

Distribution of disease genes

Seebacher and Gavin (2011). Cell 144:1000-1001

k = degree = # interaction partners

● “Essential” genes● Encode hubs● Are expressed globally

● “Non-essential” disease genes● Do not encode hubs● Tissue specific expression

Page 57: Introduction to Bioinformatics for UVA Cell Bio 8401

Distribution of disease genes● Disease genes at functional periphery of cellular networks (Goh PNAS 2007).● Genes contributing to a common disease interact through protein-protein

interactions (Goh PNAS 2007).● Diseaseome analysis: Pt 2x likely to develop another disease if that

disease shares gene with pt’s primary disease (Park et al. 2009. The Impact of Cellular

Networks on Disease Comorbidity. Mol Syst Biol 5:262).● miRNA analysis: If connect diseases with associated genes regulated by

common miRNA, get disease-class segregation. E.g. cancers share similar associations at miRNA level (Lu et al. 2009. An analysis of human microRNA and disease associations.

PLoS ONE 3:e3420).

Nonrandom placement of disease genes in interactome!

Page 58: Introduction to Bioinformatics for UVA Cell Bio 8401

Distribution of disease genesVidal et al, Cell 2011.

Page 59: Introduction to Bioinformatics for UVA Cell Bio 8401

Distribution of disease genes

● Data is cheap and diverse.○ Genetic variation: GWAS, next-gen sequencing○ Gene expression: Microarray, RNA-seq○ Proteomics: Y2H, CoAP/MS

● Cellular components interact in a network with other cellular components.

● Disease is the result of an abnormality in that network.

● Integrate multiple data types, understand network, understand disease.

Page 60: Introduction to Bioinformatics for UVA Cell Bio 8401

Pathway Analysis

● You’ve done your microarray/RNA-Seq experiment○ You have a list of genes○ Want to put these into functional context○ What biological processes are perturbed?○ What pathways are being dysregulated?○ Data reduction: hundreds or thousands of genes can be reduced to

10s of pathways○ Identifying active pathways = more explanatory power

● “Pathway analysis” encompasses many, many techniques:○ 1st Generation: Overrepresentation Analysis (E.g. GO ORA)○ 2nd Generation: Functional Class Scoring (e.g. GSEA)○ 3rd Generation (in development): Pathway Topology (E.g. SPIA)

● http://gettinggeneticsdone.com/2012/03/pathway-analysis-for-high-throughput.html

Page 61: Introduction to Bioinformatics for UVA Cell Bio 8401

Pathway Analysis: Over-representation analysis

● Many variations on the same theme: statistically evaluates the fraction of genes in particular pathway that show changes in expression.

● Algorithm:○ Create input list (e.g. “significant at p<0.05”)○ For each gene set:

■ Count number of input genes■ Count number of “background” genes (e.g. all genes on platform).

○ Test each pathway for over-representation of input genes

● Gene Set: typically gene ontology (GO) term.

Page 62: Introduction to Bioinformatics for UVA Cell Bio 8401

Pathway analysis: over-representation analysis

● Ontology = formal representation of a knowledge domain.

● Gene ontology = cell biology.● GO represented by directed acyclic graph (DAG).

○ Terms are nodes, relationships are edges.○ Parent terms are more general than their child terms.○ Unlike a simple tree, terms can have multiple parents.

Rhee, S. Y., Wood, V., Dolinski, K., & Draghici, S. (2008). Use and misuse of the gene ontology annotations. Nature Reviews Genetics, 9(7), 509-15.

Page 63: Introduction to Bioinformatics for UVA Cell Bio 8401

Pathway analysis:Over-representation analysis

● Algorithm:○ Create input list (e.g. “significant at p<0.05”)○ For each gene set:

■ Count number of input genes■ Count number of “background” genes (e.g. all genes on platform).

○ Test each pathway for over-representation of input genes● Ex: GO “Purine Ribonucleotide Biosynthetic Process”

○ 1% of input (significant) genes are annotated with this term.○ 1% of genes on the chip are annotated with this term.○ Not significantly overrepresented.

● Ex: GO “V(D)J Recombination”○ 20% of input (significant) genes are annotated with this term.○ 1% of genes on the chip are annotated with this term.○ Highly significantly over-represented!

Page 64: Introduction to Bioinformatics for UVA Cell Bio 8401

Pathway analysis

● Pathway analysis gives you more biological insight than staring at lists of genes.

● Pathway analysis is complex, and has many limitations.

● Pathway analysis is still more of an exploratory procedure rather than a pure statistical endpoint.

● The best conclusions are made by viewing enrichment analysis results through the lens of the investigator’s expert biological knowledge.

Page 65: Introduction to Bioinformatics for UVA Cell Bio 8401

Subdisciplines

● Sequence alignment (DNA, RNA, Protein)● Genome annotation● Evolutionary biology / comparative genomics● Analysis of gene expression● Analysis of gene regulation● Genotype-phenotype association● Mutation analysis● Structural biology● Biomarker identification● Pathway analysis / "systems biology"● Literature analysis / text-mining

Page 66: Introduction to Bioinformatics for UVA Cell Bio 8401

● Seqanswers○ http://SEQanswers.com○ Twitter: @SEQquestions○ Format: Forum○ Li et al. SEQanswers : An open access community

for collaboratively decoding genomes. Bioinformatics (2012).

● BioStar: ○ http://biostar.stackexchange.com○ Twitter: @BioStarQuestion○ Format: Q&A○ Parnell et al. BioStar: an online question & answer

resource for the bioinformatics community. PLoS Comp Bio (2011) 7:e1002216.

Resources: Online community & discussion forum

Page 67: Introduction to Bioinformatics for UVA Cell Bio 8401

Resources: further education

Regularly updated, comprehensive list of over 20 in-person and free online workshops in bioinformatics,

programming, statistics, genetics, etc.

stephenturner.us/p/edu

Page 68: Introduction to Bioinformatics for UVA Cell Bio 8401

Publicly Available Data: NCBI● Genbank: http://www.ncbi.nlm.nih.gov/genbank/

○ Collection of all publicly available DNA sequences.○ Feb 2013: 150,141,354,858 bases from 162,886,727 sequences.

● NCBI Genomes: http://www.ncbi.nlm.nih.gov/genome/○ Public repository for sequenced genomes.○ March 2013: 3,005 eukaryotes, 19,125 prokaryotes, 3,570 viruses.

● NCBI Taxonomy: http://www.ncbi.nlm.nih.gov/taxonomy○ Publicly available classification and nomenclature database for all organisms in the public

sequences database.○ Phylogenetic lineages for >160,000 organisms (est. ~10% life on the planet)

● GEO: http://www.ncbi.nlm.nih.gov/geo/○ Public repository of sequence- and array-based gene expression data, free for the taking.○ 900,000+ samples, 3,200+ datasets.

● dbGaP: http://www.ncbi.nlm.nih.gov/gap○ Public repository for genetic studies.○ 2,500+ datasets, 100,000+ variables.

● SRA: http://www.ncbi.nlm.nih.gov/sra○ Public repository for raw sequencing data from NGS platforms.○ 3,500,000,000,000,000 bases sequenced.

Page 69: Introduction to Bioinformatics for UVA Cell Bio 8401

Publicly Available Data: Databases● 2013 Nucleic Acids Research Database Issue

○ http://nar.oxfordjournals.org/content/41/D1/D1.abstract○ 176 articles describing new/updated molecular biology databases.

● NAR Molecular Biology Database Collection○ http://www.oxfordjournals.org/nar/database/a/○ 1,512 molecular biology databases○ Categories: DNA/RNA/Protein sequences, structures,

metabolic/signaling pathways, genes & genomes, human diseases, microarray/other gene expression data, proteomics, organelles, plants, immunological, cell bio, …

Page 70: Introduction to Bioinformatics for UVA Cell Bio 8401

Publicly Available Data: Webservers● 2012 NAR Web Server Issue

○ http://nar.oxfordjournals.org/content/40/W1.toc○ 102 articles/webservers featured

● Bioinformatics Links Directory○ http://bioinformatics.ca/links_directory/○ Includes all the NAR resources above.○ 1,376 tools, 620 databases, 163 other resources○ Topics: computer-related, DNA, education, expression,

genomics, literature, model organisms, RNA, protein, other molecules, sequence comparison, …

Page 71: Introduction to Bioinformatics for UVA Cell Bio 8401

Bioinformatics Core Mission: help scientists publish their

work and obtain new funding through service and training.

Page 72: Introduction to Bioinformatics for UVA Cell Bio 8401

Services

● Gene expression: Microarray Analysis● Gene expression: RNA-seq Analysis● Pathway analysis● DNA Variation (GWAS, NGS)● DNA Binding / ChIP-Seq● DNA Methylation● Metagenomics● Grant / Manuscript support● Custom development (computing & stats)● ... etc.

Page 73: Introduction to Bioinformatics for UVA Cell Bio 8401

Contact

Web: bioinformatics.virginia.edu

E-mail: [email protected]

Blog: GettingGeneticsDone.com

Twitter: @genetics_blog