comparative genomics and proteomics in ensembl sep 2006
TRANSCRIPT
Comparative genomics Comparative genomics and proteomics in and proteomics in
EnsemblEnsembl
Sep 2006
2 of 56
• Rationale• Species available• Comparative proteomics
– Orthologue and paralogue prediction– Protein clustering into families
• Comparative genomics– Genome-wide DNA alignments– Synteny block characterisation
• Future and perspectives
OverviewOverview
3 of 56
The Compara database is one single multispecies database
• Gene orthology/paralogy prediction• Protein clustering• Whole genome alignments• Synteny regions
ComparaCompara
4 of 56
The era of sequencing genomesThe era of sequencing genomes
360
450
990 25
70
140
?
550
25070?
1002003004005001000
Million years
340
1500?
?
Chordata
Vertebrata
AmniotaTetrapoda
Teleostei
Urochordata
Arthropoda
NematodaFungi
Red : whole genome assembly availableGreen : whole genome assembly due within the next year in Ensembl
* 19 species currently in Ensembl* 19 species currently in Ensembl+ 10 + 10 Pre! Pre! EnsemblEnsembl
S. cerevisiae (baker’s yeast) *
C. elegans (nematode) *
A. mellifera (honey bee) *
D. rerio (zebrafish) *
D. melanogaster (fruitfly) *A. gambiae (African malaria mosquito) *A. aegypti (yellow fever mosquito) +
C. intestinalis (transparent sea squirt) * C. savignyi (sea squirt) +
T. rubripes (torafugu) *T. nigroviridis (spotted green pufferfish) *
O. latipes (Japanese medaka)
G. aculeatus (Stickleback) +
23
O. aries (sheep)
G. gallus (chicken) *
X. laevis (African clawed frog)
M. musculus (house mouse) *R. norvegicus (Norway rat) *
M. mulatta (rhesus macaque) *P. troglodytes (chimpanzee) *
C. familiaris (dog) *F. catus (cat)E. caballus (horse)S. scrofa (pig)B. taurus (cow) *
310
197
92
M. domestica (opossum) *
170
L. africana (elephant) +
105
41
91
4574
83
65
20
H. sapiens (human) * +
X. tropicalis (western clawed frog) *Amphibia
AvesMetatheria
Mammalia
Eutheria
5 of 56
• From the Ensembl perspective joins species through– orthologous/paralogous genes links– chromosome synteny links– protein family links
• From a broader perspective– Where are syntenic regions located?– How many genes are conserved?– Where are orthologous/paralogous genes?– Is gene order conserved?– Where are potential regulatory regions?– What is missing in one species, present only in another?
Comparing different speciesComparing different species
6 of 56
Orthologue and Paralogue Orthologue and Paralogue PredictionPrediction
• Evolutionary studies• Identify potential species-specific
proteins/genes• Identify orthologues of (human)
genes in model organisms
7 of 56
Gene EvolutionGene Evolution
• Divergence
• Speciation / Duplication
• Change within allelic population
• Point Mutations / Selection / Drift
• Exon/domain shuffling
• Transposition / Translocation
• Retroposition (reverse transcription)
• Horizontal gene transfer?
Orthologues and ParaloguesOrthologues and Paralogues
Reconstruct the Molecular Evolutionary history from the evidence visible within the known extant genes
8 of 56
• Orthologues : any gene pairwise relation where the ancestor node is a speciation event
• Paralogues : any gene pairwise relation where the ancestor node is a duplication event
HomologueHomologue RelationshipsRelationships
9 of 56
Atime
Duplication
M 2’
Speciation
Duplication
M 2
A 1 A 2
M 1 H 1
H 2
Inparalogues
OutparaloguesOrthologues
Inparalogues
Inparalogues
Orthologous genes have originated from a single ancestor (often have equivalent functions).Paralogous are genes related via duplication:
•Inparalogues (ortholog_one2one, ortholog_one2many, etc.) duplication follows speciation and •Between_species_paralog (outparalogues). Duplication precedes speciation
Homologue RelationshipsHomologue Relationships
10 of 56
• Find orthologous genes by comparing the protein sets of two species (only the longest peptide considered).• blastp+sw all versus all (on a paired species basis)• Build a graph of gene relations based on BRH (best reciprocal hit) and BSR (BLAST score ratio)• Extract connected components (single linkage clusters ), each cluster representing a gene family
Mouse HumanMouse Human Mouse Human
Human
Human
Orthology Prediction AlgorithmOrthology Prediction Algorithm
11 of 56
GeneTree prediction: GeneTree prediction: MUSCLE/PHYMLMUSCLE/PHYML
• Multiple alignment of clusters with MUSCLE (based on BRH and BSR).•Unrooted gene tree built using PHYML (Guidon & Gascuel, 2003)•Tree reconciliation (gene tree with species tree) to call duplication event on internal nood and root the tree using RAP (Dufayard et al. 2005)• Infer pairwise relations of orthology and paralogy types (from each tree)
12 of 56
Molecular PhylogeneticsMolecular Phylogenetics
• Protein sequences in different species, both:
• Provide information about the history of evolution
• Reconstruct evolution
• We are after an alignment that equally reflects all species:
• Modeling the branching processes by comparing gene and species trees (tree reconciliation)
13 of 56
PhylogeniesPhylogenies
Duplication nodeSpeciation node or leaf
Revealing the evolutionary history that has led to the organisms at the current stage.
- Leaves are real genomes- Internal nodes are ancestors
14 of 56
Orthologue and Paralogue typesOrthologue and Paralogue types
• ortholog_one2one• ortholog_one2many• ortholog_many2many• apparent_ortholog_one2one
• within_species_paralog• between_species_paralog
15 of 56
……in Ensembl…in Ensembl…
16 of 56
Orthologue and ParalogueOrthologue and Paralogue typestypes
17 of 56
GeneViewGeneView
18 of 56
GeneViewGeneView
19 of 56
Links to ATV and JalView
GeneTreeMUSCLE
protein alignment
GeneTreeViewGeneTreeView
20 of 56
Duplication node (red)
Speciation node (blue)
GeneTreeViewGeneTreeView
21 of 56
ATVATV
22 of 56
Protein clustering into familiesProtein clustering into families
• Cluster proteins from different organisms that may share the same function
• Obtain some kind of description for ‘novel’ genes/proteins
• Locate family members over the whole genome
• Identify possible orthologues and paralogues in other species
23 of 56
Protein DatasetProtein Dataset
• Nearly a million proteins clustered:– All Ensembl proteins from all species in Ensembl
• 513,256 predicted proteins
– All metazoan (animal) proteins in UniProt
• 55,892 UniProt/Swiss-Prot
• 469,725 UniProt/TrEMBL
• Blastp all versus all, then clustering with MCL
24 of 56
Clustering StrategyClustering Strategy
• BLASTP all-versus-all comparison
• Markov clustering
• For each cluster:– Calculation of multiple sequence
alignments with ClustalW– Assignment of a consensus
description
25 of 56
Markov Clustering (MCL)Markov Clustering (MCL)
• MCL for Markov CLustering algorithm, based on flow simulation in graphs (http://micans.org/mcl/)• Keeps into the same graph/cluster only very well inter-connected nodes (proteins) in the same graph (cluster)
• Allows rapid and accurate detection of protein families on large-scale.• Automatic description and clustalw multiple alignment applied on each cluster
MCL
26 of 56
Link to FamilyView
ProtViewProtView
27 of 56
Ensembl family members
within human
Ensembl family members in
other species
JalView multiple alignments
FamilyViewFamilyView
28 of 56
For For eacheach cluster cluster
• We store– Description and score– Multiple alignment
• Future extensions– Improving descriptions– Multiple alignment assessment– Build phylogeny on each cluster
• Using the multiple alignment• Using dS values (mainly inside mammals)• Extend paralogous prediction
29 of 56
Aligning complete genomesAligning complete genomes
30 of 56
Whole Genome AlignmentsWhole Genome Alignments
• Understand what evolution has done on the species compared, after speciation – What is missing in one species, present only
in another?– Differences between closely related species
may help understanding speciation• Define syntenic regions, those long
regions of DNA sequences were order and orientation is highly conserved
• Conserved non-coding regions– Guides to putative regulatory regions
31 of 56
Evolution at the DNA levelEvolution at the DNA level
…ACTGACATGTACCA…
…AC----CATGCACCA…
Mutation
Sequence edits
Rearrangements
Deletion
InversionTranslocationDuplication
32 of 56
Basic IdeaBasic Idea
• Functional sequences evolve more slowly than non-functional sequences
• Comparing genomic sequences from species at different evolutionary distances allows us to identify:– Coding genes– Non-coding genes– Non-coding regulatory sequences
33 of 56
Aligning large genomic sequencesAligning large genomic sequences
• Independent from protein/gene predictions• Should find all highly similar regions between two
sequences• Should allow for segments without similarity,
rearrangements etc.– Computes run only by few dedicated groups
• Issues– Heavy process– Scalability, as more and more genomes are sequenced– Time constraint– Computes run only by few dedicated groups– As the «true» alignment is not known, then difficult to
measure the alignment accuracy and apply the right method
34 of 56
Using a local alignerUsing a local aligner
• Local alignment– Find all highly similar regions over 2 sequences
• Find the orthologous as well as all the paralogous sequences
– Separated by segments without alignment
– Can handle rearranged sequences– Need post- filtering to limit too much
overlapping alignments
35 of 56
Local Local vv Global Alignment Global Alignment
AG
TG
CC
CT
GG
AA
CC
CT
GA
CG
GT
GG
GT
CA
CA
AA
AC
TT
CT
GG
A
AGTGACCTGGGAAGACCCTGAACCCTGGGTCACAAAACTTAATC AGTGACCTGGGAAGACCCTGAACCCTGGGTCACAAAACTTAATC
AG
TG
CC
CT
GG
AA
CC
CT
GA
CG
GT
GG
GT
CA
CA
AA
AC
TT
CT
GG
A
Local Global
Advantages Compares large genomic regions (requires syntenic maps)
Can detect, rearrangements like translocations, inversions and duplications (!)
Detects insertions and deletions
Disadvantages Fails to identify insertions or deletions
Fails to detect rearrangements (inversions)
36 of 56
GlocalGlocal Alignment ProblemAlignment ProblemFind least cost transformation of one sequence into another using new operations
•Sequence edits (indels, mutations)
•Inversions
•Translocations
•Duplications
•A combination of these
GTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGAG
AGTGACCTGGGAAGACCCTGAACCCTGGGTCACAAAACT
Glocal aligner (Brudno et al., 2003)
37 of 56
BLASTZ-net, tBLAT and MLAGANBLASTZ-net, tBLAT and MLAGAN
• BLASTZ-net (comparison on nucleotide level) is used for species that are evolutionary close, e.g. human - mouse
• Translated BLAT (comparison on amino acid level) is used for evolutionary more distant species, e.g. human - zebrafish
• MLAGAN global alignment used for multispecies alignments
38 of 56
all all versusversus all approach using all approach usingBLASTZ BLASTZ (collaboration with UCSC)(collaboration with UCSC)
• Can handle large sequences
• Used 2-weighted spaced seeding strategy• Dynamic masking
• Makes distinction between repeat and non-repeat sequences (soft masking)• Try aligning inside repeats
• One iterative step with lower threshold to expand alignments
39 of 56
Blastz strategyBlastz strategy
• 10Mb Human fragments (3000)• 30Mb Mouse fragments (100)• Lineage-specific repeats removed
• 48 hours on 1024 CPUs
• Generates 9Gb of output
• When filtered for Best hit on Human, reduced to 2.5Gb•10Mb Human fragments (3000)• 30Mb Mouse fragments (100)
40 of 56
Blastz human genome coverageBlastz human genome coverage
• 40% of the human genome is covered by an alignment of mouse sequences
By rescoring the alignment over a “tight” matrix that is very stringent and look for high conservation (>70% identity), the coverage goes down to 6%
41 of 56
DNA/DNA matches web displayDNA/DNA matches web display
ContigView human EPO
Conserved sequences
42 of 56
DotterViewDotterView
Mouse sequence
Humansequence
43 of 56
Multiple alignmentsMultiple alignments
• Currently 3 sets:– MLAGAN-primates:
– MLAGAN-amniote vertebrates:
– MLAGAN-eutherian mammals:
44 of 56
StrategyStrategy
• Use all coding exons• Use all coding exons
• Get sets of best reciprocal hits
• Use all coding exons
• Get sets of best reciprocal hits
• Create orthology maps
• Use all coding exons• Get sets of best reciprocal hits• Create orthology maps• Build multiple global alignments
45 of 56
MultiContigMultiContigViewView
46 of 56
MultipleMultiple alignmentsalignments
ContigView human EPO
47 of 56
Alignment on basepair level
Human
Dog
Rat
Mouse
Export alignments
AlignSpliceViewAlignSpliceView
48 of 56
MultiContigView MultiContigView vs.vs. AlignSliceView AlignSliceView
49 of 56
AlignViewAlignView
50 of 56
GeneSeqalignViewGeneSeqalignView
51 of 56
GeneSeqalignViewGeneSeqalignView
52 of 56
Syntenic RegionsSyntenic Regions
• Genome alignments are refined into larger syntenic regions
• Alignments are clustered together when the relative distance between them is less than 100 kb and order and orientation are consistent
• Any clusters less than 100 kb are discarded
53 of 56
SyntenyViewSyntenyViewHuman
chromosome
Mouse chromosomes
Mouse chromosomes
Orthologues
54 of 56
Syntenic blocks
CytoViewCytoView
55 of 56
OutlookOutlook
• OrthoView• Displaying alignments both from whole genome alignments and on orthologues• Consider all isoforms for each gene•Calculate dN/dS
56 of 56
AcknowledgementsAcknowledgements
• Abel Ureta-Vidal• Benoît Ballester• Kathryn Beal• Stephen Fitzgerald• Javier Herrero• Albert Vilella
Ensembl team
Sep 2006
57 of 56
Basic ideaBasic idea
Speciation event
selection
alignment
mutations
Ancestor sequence
MutationRegulatory regionExon
58 of 56
Global Global vv Local Alignments Local AlignmentsLocalGlobal
Advantages Disadvantages
Local Compares large genomic regions (uses syntenic maps)
Can detect, rearrangements like translocations, inversions and duplications (!)
Fails to identify insertions or deletions
Global Detects insertions and deletions
Fails to detect rearrangements (inversions)
(-)
1 2
1 2
inversion duplication
Glocal aligner (Brudno et al., 2003) pairwise only
59 of 56Adapted from Sonnhammer & Koonin (2002) TIG 18, 12: 620
Inparalogues Inparalogues vs vs OutparaloguesOutparalogues
60 of 56
Problems: weak orthologiesProblems: weak orthologies
61 of 56
Problems: missalignmentsProblems: missalignments
62 of 56
Possible solutionsPossible solutions
• Weak orthologies:
• Poor alignments:– report to author– edit alignments, detect wrong
edges, redefine blocks– use another aligner
63 of 56From Edgar, R. C. (2004) NAR 32:1792-1797