comparative genomics and target discovery maarten sollewijn gelpke mdi, organon
Post on 19-Dec-2015
217 views
TRANSCRIPT
What is comparative genomics? What can we learn from comparative genomics? What is target discovery? What are the implications of comparative genomics to
target discovery? What issues in target discovery can be addressed by
comparative genomics?
Overview
Introduction to genomes and sequencing. Comparative genomics aspects. Phylogenomics concepts. Examples of comparative genomics.
Sequence availability
Availability of gene and protein sequences has increased enormously in during the last 2 decades.
Current capacity of the main sequencing centers is >3Gb per month per centre.
This will increase again dramatically with the development of new superfast sequencing techniques.
Currently > 100Gbases
Genomes sequenced
First bacterial genomes sequencedH.influenzae and M.genitalium
The yeast genome
1995
1996
1997E.coli K12
1998
C.elegans
1999
Full sequence of chr. 22
2000
D.melanogaster Genome & Chr. 21
Human draft
2001
A.thaliana•MouseMouse•CionaCiona•RiceRice•FuguFugu•AnophelesAnopheles
2002
2003
Chimpanzee
2004
2005
•Human finishedHuman finished•RatRat•ChickenChicken
XenopusZebrafish
Genome sequencing
Evolutionary relationship between metazoans (multicellular animals) that have been sequenced or are due for sequencing.
primates
rodents
bovidae
mammalia
aves
chordates
vertebrates
metazoans
insects
nematodes
tunicates
fish
tetrapods
amphibia
amniotacarnivores
Genome sequencing
BAC fingerprinting shotgun approach Accurate but laborious!
Shotgun sequencing (WGS)
Bac Clone: 100-200 kbBac Clone: 100-200 kb
Sheared DNA: 1.0-2.0 kbSheared DNA: 1.0-2.0 kb
SequencingTemplates: SequencingTemplates
RandomReadsRandomReads
Whole Genome: 30Mb – 3Gb
Low Base Low Base QualityQuality
ConsensusConsensusSequenceSequence
GapGap
SingleSingleStrandedStrandedRegionRegion
MisMis--AssemblyAssembly
((InvertedInverted))
Low Base Low Base QualityQualityLow Base Low Base QualityQuality
ConsensusConsensusSequenceSequence
GapGap
SingleSingleStrandedStrandedRegionRegion
MisMis--AssemblyAssembly
((InvertedInverted))
ConsensusConsensusSequenceSequence
GapGapSequenceSequence
GapGap
SingleSingleStrandedStrandedRegionRegion
SingleSingleStrandedStrandedRegionRegion
MisMis--AssemblyAssembly
((InvertedInverted))MisMis--AssemblyAssembly
((InvertedInverted))
Assembly
Finishing
Genome sequencing
Current state of sequenced ‘organisms’: >316 Prokaryotes >27 Archae >280 Eukaryotes (complete, in assembly or in progress) >1600 Viruses and > 500 mitochondria/chloroplasts
Some ongoing genome sequencing projects: Poplar, gibbon, platypus, Drosophila species, variety of
pathogenic fungi and bacteria, etc. Meta-genomic projects on environmental samples (soil, deep-
sea, waste sites)
Future of genome sequencing?
New complete genomes. New low-redundancy genomes. New (low-redundancy) genome areas. Meta-genomics. Sequencing of microbial communities. Sequencing of extinct species.
40000 year old Cave bear: 26k, 21 genes. 45000 year old Neanderthaler: 75k diverged from human
lineage ~ 315000 years ago
Comparative genomics Discover what lies hidden in genomic sequence by
comparing sequence information. Main areas
Whole genome alignment Gene prediction Regulatory element prediction Phylogenomics Pharmacogenetics
Affected by evolutionary aspects Mutational forces (introduce random mutations) Selection pressures Ratio of non-synonymous to synonymous substitutions Mutation rates lower or higher than neutral
Comparing sequences, methods.
Pairwise comparison of sequences (alignments) proteins or genes variety of local alignment
tools like BLAST, Smith-Waterman etc.
multiple sequence comparisons (ClustalW, Muscle etc.)
results may be dependent on alignment settings
Comparing sequences, methods.
Whole genome comparisons Large stretches of sequence Divergence up to 450Mya (fugu-human) with sufficient
similarity remaining. BLAT, BLASTZ, Phusion/BlastN
Seeding strategy → alignment extension → gapped alignments
Whole genome comparison
Conservation of synteny! Cross-reference of any genetic traits (diseases!) from one
organism (eg mouse) to genes in the syntenic regions in the other organism (eg human).
Genome expansion and contraction Genome duplications, segmental duplications: important
mechanism for generating new genes. (G+C) content, CpG islands
Reflect different mutational or DNA repair processes? Repeats
Transposable elements are a main force in reshaping genomes. TE’s (or remainders thereof) can be used to measure evolutionary forces acting on the genome.
Neutral mutation rate.
Gene prediction Comparing sequences has contributed enormously to the accuracy of
gene prediction.
Evidence based method. Use cDNAs, ESTs and proteins from various organisms. Apply gene feature rules.
Proteins
Clustered ESTs
cDNAs
Gene model
Gene prediction
De novo methods. Alignment of genomic sequences Splicing rules and other gene features
De novo gene prediction by comparing sequences attempts to model a negative selection of mutations. Areas with less mutations are conserved because the mutations where detrimental for the organism.
Prediction of similar proteins in both genomes.
Newly predicted protein in mouse and human, similar to the disease related gene dystrophin.
Regulatory element prediction
The complexity of higher eukaryotes and their relatively low number of genes can be explained partially through the importance of transcriptional regulation.
Identification of RE’s will have an extensive impact in understanding gene expression patterns (expression intensity, tissue specificity), relations within expression patterns and inferring biological systems or networks.
Regulatory element prediction
No formal models for regulatory motifs Attempt to find conserved regions or motifs based on the
global alignment of similar sequences of different organisms (phylogenetic footprinting). Which species to compare? Evolutionary distance? What regions around gene models to investigate? 5’ and 3’
flanking regions, introns? Take expression patterns into account? How does evolution affect RE’s?
Phylogenomics
Comparison of genes and gene products across a number of species (whole genomes), characterizing homologues and gain insights in the evolutionary process itself.
Pharmacophylogenomics is the use of phylogenomics in aid of drug discovery, through improved target selection and validation.
Orthology and paralogy
Orthologs: genes in different species that arose from a single gene in the most recent common ancestor, by speciation.
Paralogs : genes in the same species that arose from a single gene in a ancestral species, by a process of gene duplication.
Phylogenetic treeof gene X
Target orthology Species differences frequently affect progression of
targets and compounds. Orthology maps in combination with expression studies may explain these differences.
Establishing orthology Reciprocal highest scoring Blast hit. Conservation of synteny. Gene loss or rate of evolution issues.
Orthology does not guarantee common function (functional shift). Extensive sequence divergence High non-synonymous over synonymous nucleotide substitution
ratios. Comparison of regulatory regions?
Target paralogy
Key insights in large pharmacologically relevant families (NRs, GPCRs) can be gained from paralogy analysis.
Paralogy is inter-related with several other gene to function occurrences that can seriously affect the suitability of genes as drug targets
Schematic representation of various mappings of genes to functions.
Function
Protein
Gene
Paralogy Pleiotropy Redundancy Heteromery CrosstalkAlternativetranscription
Pleiotropy Suggested to precede paralogy Relaxed substrate or ligand specificity Multiple protein domains Tissue or cellular localization
Redundancy Total or partial redundancy of function Directly linked to paralogy Robustness against gene knock-outs (target validation) PPAR-δ / PPAR-α in skeletal muscle; PXR / FXR in bile acid
signaling; dopamine transporters / serotonin transporters in adjacent neurons.
Heteromery Formation of heteromers between paralogs Known examples in major classes of drug targets
GPCRs : GABAβ receptors NRs : formation of heterodimers with retinoid X rexeptors (RXR) Ion channels
Crosstalk Combination of pleiotropy and redundancy May be regulated in time and space (expression and
localization) Action of cytokines (interleukins) on immune cell types.
Alternative transcription Intermediate between paralogy and pleiotropy. ‘paralogy in
place’ Increases effective size of the genome (estimated >30% of
human genes show alternative transcription!)
P
Effects on drug discovery
Functional shifts, pleiotropy and redundancy potentially have good or bad news for drug discovery.
Functional shifts Misleading or unavailable animal model Animal toxicity irrelevant for humans
Pleiotropy Unintended drug effects Opportunities for multiple indications
Redundancy Disease resistant to treatment (multi-functionality) Highly selective treatment for complex diseases.
Pharmacogenetics Within species comparative genomics:
Single Nucleotide Polymorphisms: SNPs Current focus in coding regions, expected to expand to
sites of transcription regulation. Determine the site of a SNP and the allele frequencies
from ethnic or multi-ethnic panels of individuals (eg 100)
Pharmacogenetics (PGx): relate SNP information to efficacy and safety issues during the drug development process. Efficacy PGx: Select/predict drug responders, increase
confidence in a certain drug in development. Safety PGx: Identification of individuals with adverse effects to a
drug
Examples
New genes and REs from yeast genomes. Multi species comparisons from targeted genomic
regions. Comparative genomics at the vertebrate extremes. Pharmacogenetics in drug efficacy
Comparison of yeast species to identify genes and regulatory elements. (Kellis et al, Nature 2003)
Saccharomyces cerevisiae and 3 related species
7x coverage WGS of each species Assembly of draft genome sequence S.cerevisiae genome aligned to others using ORFs as seeds
Most ORFs have 1:1 matches. Considerable conserved synteny. Most genomic rearrangements clustered in telomeric regions. Local gene family expansion/contraction, creating phenotypic
diversity over evolutionary time.
Balance between conservation and divergence allows for accurate gene identification and recognition of REs as well!
Identification of genes
Original S.cerevisiae genome (1996): 6275 ORFs Re-analysis and other evidence (2002): 6062 ORFs This study validates all ORFs using a reading frame
conservation score (very sensitive). 5538 ORFs, 20 unresolved, 504 rejected ORFs!
In addition to gene recognition, also largely improved gene structure definitions (start, stop, intron).
Identification of regulatory elements
REs are difficult to identify Short (6-15bp), sequence variation, few known rules
De novo discovery of REs directly from genomic sequence. Develop a motif conservation score system based on known motifs 78 motifs discovered, overlapping with 36 of 55 known motifs
Putative annotation of motifs using adjacent genes. (GO) 25 of 42 new motifs show high category annotation correlation
Discovery of combinatorial control of Res
Applications to human genome? Increase number of species in comparison to enrich the low signal to
noise ratio in humans.
Multi species comparisons from targeted genomic regions. (Thomas et al, Nature 2003)
Comparing targeted regions areas in multiple evolutionary diverse vertebrates (less probable for conservation to occur by chance)
ENCODE project 44 genomic regions (14 manually selected of which some
disease related, 30 random) of diverse gene density and non-exonic conservation
primates, bat, alligator, elephant, cat, emu, leopard, salmon etc.
Initial analysis 1.8 Mb on chromosome 7 containing 10 genes, including CFTR, from 12 species. Detection of ~1000 multi-species conserved sequences of which
>60% would not be detected by a 2 species comparison.
Comparative genomics at the vertebrate extremes (Bofelli et al, Nature 2004)
What can be learned from comparisons of genomes that are distant or closely related in evolution?
Distant comparisons reveal the most constrained sequence elements. Most of the conserved human-fish non-coding sequences are
found near genes with roles in embryonic development. Mutations can have an important role in human disease
Human-Fugu conservation of non-coding sequence in the DACH gene area (development of brain, limbs, sensory organs).
Validation of identified enhancer regions by driving expression of a reporter in mouse embryos.
Comparative genomics at the vertebrate extremes
Intraspecies sequence comparisons allow identification of species specific sequences Phylogenetic shadowing Requires high rate of polymorphism
Comparison among primates show human specific sequences Analysis of regulatory sequence of
ApoA (involved in human heart disease)
A. Mutation rate analysis of Ciona intestinalis 5` region of the forkhead gene. B. Validation of identified potential regulatory elements in Ciona larvae.