comparative bacterial genomics joão carlos setubal vbi/virginia tech for embo course florianopolis,...

Post on 23-Jan-2016

222 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Comparative bacterial genomics

João Carlos SetubalVBI/Virginia Techfor EMBO course

Florianopolis, July 2008

Contents

• Tree of Life• Basic notions of genomics• Motivation for comparative genomics• Whole replicon alignment: pairwise and

multiple• Gene-centric comparisons• Orthology and Synteny• Exercises

April 21, 2023 JC Setubal 2

April 21, 2023 3JC Setubal

Ciccarelli et al, Science, 2006

April 21, 2023 4JC Setubal

5Williams, Sobral, and DickermanJBAC, 2007

proteobacteria

April 21, 2023 JC Setubal

Genomes

• The entire DNA complement of a single cell• Abstraction

– a string s in the alphabet = {A, C, G, T}– Example

CTTCCAGTTCAACCGGCCGGTCGTCGCGGACGACGCGGCCGCCGGCGCCGCGATGCTGGCGGACGTACCGCACACCCGCCCCATCTCCATCTTCGCTTC

April 21, 2023 6JC Setubal

Genome sizes

• Genomes are measured in – kb (kilo base pairs), Mb (mega), or Gb (giga)

• Viruses: |s| = [5 – 200] kb• Bacteria: |s| = [1 – 10] Mb• Eukaryotes: |s| = [10 Mb – 100 Gb]• Humans: 3 Gb• marbled lungfish: 130 Gb T. Gregory, www.genomesize.com

April 21, 2023 7JC Setubal

Famous bacteria

• Haemophilus influenzae (1.8 Mb)– Human pathogen, first genome to be sequenced (1995)

• Escherichia coli (4.6 Mb)– Human pathogen and model organism (1997)

• Agrobacterium tumefaciens (6 Mb)– Plant pathogen and biotechnology tool (2001)

April 21, 2023 8JC Setubal

What is a gene

• A small substring of s that contains information

• Bacteria generally have 1 gene every 1 kb– 5 Mb genome = 5,000 genes

April 21, 2023 9JC Setubal

>A small section of a genomeAGCTCGCGCTCCGCATCCATCCAGTAGGGTTCGGTGTCGACGAGCGTGCC

GTCCATATCCCAGAAGACGGCGGCCGGCATCGCGTGCGGAGTCAGTTCGG

TCACGGCTGACAAGTCTATCCCGGCGGCCCCGGGCCTATTCTTGAGGGAC

GGCGTCCTGACCGGTCGCCGGATGAAAGGACCAGAACGCCCCGTGACTGA

CGCGAACAGCATCCTCGGAGGGCGCATCCTCGTGGTGGCCTTCGAAGGGT

GGAACGACGCTGGCGAGGCCGCCAGCGGGGCCGTCAAGACGCTCAAGGAC

CAGCTGGATGTCGTCCCGGTCGCCGAGGTCGATCCCGAGCTGTACTTCGA

CTTCCAGTTCAACCGGCCGGTCGTCGCGGACGACGACGGCCGCCGGCGCC

TCATCTGGCCGTCCGCGGAGATCCTGGGCCCAGCTCGCCCCGGCGACACC

GGCGATGCGCGCCTGGACGCCACCGGCGCCAACGCGGGCAATATCTTCCT

TCTCCTCGGCACCGAGCCGTCGCGCAGCTGGCGCAGCTTCACCGCGGAGA

TCATGGATGCGGCCCTGGCCTCCGACATCGGCGCCATCGTCTTCCTCGGT

GCGATGCTGGCGGACGTACCGCACACCCGCCCCATCTCCATCTTCGCTTC

GAGCGAGAACGCGGCCGTCCGTGCGGAGCTCGGCATCGAACGCTCTTCGT

ACGAGGGGCCGGTCGGTATCCTGAGCGCGCTCGCCGAAGGGGCGGAGGAC

GTGGGCATTCCGACCATCTCCATCTGGGCGTCGGTTCCGCACTATGTCCA

CAATGCGCCCAGCCCGAAGGCGGTGCTCGCACTGATCGACAAGCTCGAAG

AGCTGGTGAATGTCACCATCCCGCGTGGCTCGCTGGTGGAGGAGGCCACG

GCCTGGGAAGCCGGGATCGACGCGCTGGCTCTGGACGACGACGAGATGGC

TACGTACATCCAGCAGCTGGAGCAGGCACGCGACACCGTGGACTCCCCTG

AGGCCAGCGGCGAGGCGATCGCCCAGGAGTTCGAGCGCTACCTCCGCCGC

CGCGACGGCCGCGCCGGCGATGACCCCCGCCGTGGCTGACGTCACCCCCT

CTCTGCGTCCGCCGTCCTCTGTTCCCCCCGCTCGGCCTCCCCTGAGGCCG

AGGAGTCGCGCCCACATGCCGGAAACTCCTCCTTTCCTGACTTTCTGGAG

A bacterial gene

April 21, 2023 10JC Setubal

“Central Dogma” of molecular biology

• gene (DNA) messenger (RNA) protein (aminoacids)

transcription translation

Proteins are 3D objectsmade out of a linear sequence

of amino acids

April 21, 2023 11JC Setubal

A protein

www.berkeley.edu/.../ images/ras-rid-protein.gif April 21, 2023 12JC Setubal

Molecular Plant-Microbe Interactions

Sugar cane pathogen

Rattoon-stunting disease

Monteiro-Vitorello et al 2004

April 21, 2023 13JC Setubal

Comparative genomics

• There are currently more than 300 completed sequenced microbial genomes publicly available

• Many are of closely related species• In a few years there will be thousands• Why compare?• How to do it?

April 21, 2023 14JC Setubal

Why comparative genomics?

• To understand the genomic basis of the present– Differences in lifestyle

• pathogen vs. nonpathogen • Obligate vs. free-living

– Host specificity• animals vs. plants, plant X vs. plant Y, etc

– In the case of pathogens: this understanding should help us in fighting disease

• To understand the past– How organisms evolved to be what they are

April 21, 2023 15JC Setubal

Citrus cankerXanthomonas

axonopodis pathovar citri

April 21, 2023 16JC Setubal

Black rot: Xanthomonas campestris pathovar campestris

April 21, 2023 17JC Setubal

What is comparative genomics • Assuming input is the sequence and its annotation• There are many ways that genomes can be compared

– Different resolutions

• Whole genome– Genome alignments– Synteny (gene order conservation)– Anomalous regions

• Gene-centric– Gene families and unique genes– Gene clustering by function

• Gene sequence variations– Codon usage, SNPs, inDels, pseudogenes

April 21, 2023 18JC Setubal

Resolution

• Low resolution– Scope: entire genomes– Example event: rearrangement

• High resolution– Scope: nucleotide sequences– Example event: single mutation

April 21, 2023 JC Setubal 19

Genome-wide evolutionary events

• Replicon rearrangements• Gene/region duplication• Gene/region loss• Chromosome plasmid DNA exchange• Lateral transfer

April 21, 2023 20JC Setubal

Copyright ©2004 by the National Academy of Sciences

Boussau, Bastien et al. (2004) Proc. Natl. Acad. Sci. USA 101, 9722-9727

Fig. 4. Net gene loss or gain throughout the evolution of the {alpha}-proteobacterial species

April 21, 2023 21JC Setubal

Example of a “multipartite genome”

Agrobacterium tumefaciens C58

April 21, 2023 22JC Setubal

Replicon structure in all completely sequenced rhizobiaceae plus M. loti

c58 s4 k84 Retli Rleg Sm Ml

1 2.84 3.73 4.00 4.38 5.06 3.65 7.04

2 2.07 1.28 2.65 0.64 0.87 1.68 0.35

3 0.54 0.63 0.39 0.51 0.68 1.35 0.21

4 0.21 0.26 0.18 0.37 0.49

5 0.21 0.04 0.25 0.35

6 0.13 0.19 0.15

7 0.08 0.18 0.15

Numbers are replicon size in Mbp

replicon

genome

April 21, 2023 23JC Setubal

Whole replicon alignments: the pairwise case

If the sequences were identical we would see

B

AApril 21, 2023 24JC Setubal

an inversion

A B C D

A

C B

D

April 21, 2023 25JC Setubal

A B C D

A

C

D

B

Such inversions seem to happen around the origin or terminus of replication

April 21, 2023 26JC Setubal

Eisen JA, Heidelberg JF, White O, Salzberg SL. Evidence for symmetric chromosomal inversions around the replication origin in bacteria. Genome Biol. 2000;1(6):RESEARCH0011

April 21, 2023 27JC Setubal

Replicon sequence comparisons

• Basic tool: MUMmer – Delcher AL, Phillippy A, Carlton J, Salzberg SL. Fast

algorithms for large-scale genome alignment and comparison. Nucleic Acids Res. 2002 Jun 1;30(11):2478-83.

– Kurtz S, Phillippy A, Delcher AL, Smoot M, Shumway M, Antonescu C, Salzberg SL. Versatile and open software for comparing large genomes. Genome Biol. 2004;5(2):R12

• http://mummer.sourceforge.net

April 21, 2023 28JC Setubal

29

Xanthomonas axonopodis pv citri

E. coli K12 Promer alignment

Both are proteobacteria!Red: direct; green: reverse

April 21, 2023 30JC Setubal

Basics of MUMmer

• It finds Maximal Unique Matches• These are exact matches above a user-specified threshold

that are unique• Exact matches found are clustered and extended (using

dynamic programming)– Result is approximate matches

• Data structure for exact match finding: suffix tree– Difficult to build but very fast

• Nucmer and promer– Both very fast– O(n + #MUMs), n = genome lengths

April 21, 2023 31JC Setubal

sample nucmer output (coords file)• /home/setubal/agro/comp/mummer/../../rhizogenes/v1/ctgs.fasta

/home/setubal/agro/comp/mummer/../../vitis/v3/all.fasta• NUCMER

• [S1] [E1] | [S2] [E2] | [LEN 1] [LEN 2] | [% IDY] | [TAGS]• =====================================================================================• 73024 73193 | 242351 242181 | 170 171 | 93.60 | Contig789 Contig608• 220 6244 | 38759 32766 | 6025 5994 | 86.64 | Contig791 Contig604• 2798 6297 | 174039 177532 | 3500 3494 | 83.31 | Contig791 Contig606• 3828 6297 | 124183 126645 | 2470 2463 | 81.80 | Contig791 Contig606• 4767 5392 | 551684 551059 | 626 626 | 82.11 | Contig791 Contig607• 8214 8453 | 30747 30508 | 240 240 | 84.65 | Contig791 Contig604• 15408 15987 | 181050 181624 | 580 575 | 86.23 | Contig791 Contig606• 63864 74254 | 191954 181567 | 10391 10388 | 89.08 | Contig791 Contig604• 77203 79534 | 178882 176555 | 2332 2328 | 84.35 | Contig791 Contig604• 157451 158456 | 139804 140812 | 1006 1009 | 82.09 | Contig791 Contig606• 157483 157800 | 58429 58110 | 318 320 | 89.13 | Contig791 Contig604• 163575 166223 | 62781 60133 | 2649 2649 | 78.80 | Contig791 Contig605• 166754 168442 | 49403 47716 | 1689 1688 | 85.79 | Contig791 Contig604• 171247 173701 | 45005 42556 | 2455 2450 | 88.17 | Contig791 Contig604• 171261 172115 | 157617 158476 | 855 860 | 86.30 | Contig791 Contig606• 181828 184458 | 41748 39140 | 2631 2609 | 93.13 | Contig791 Contig604• 184829 185852 | 38838 37821 | 1024 1018 | 91.61 | Contig791 Contig604

April 21, 2023 32JC Setubal

April 21, 2023 JC Setubal 33

A suffix tree for BANANAS

www.somethinkodd.com/.../2006/01/suffixtree.png

Proteome alignment done with LCS (top: Xcc; bottom: Xac )

Blue: BBHs that are in the LCS; dark blue: BBHs not in the LCS; red: Xac specifics; yellow: Xcc specifics

April 21, 2023 34JC Setubal

Whole replicon multiple alignment

• The program MAUVE• Darling AC, Mau B, Blattner FR, Perna NT.

Mauve: multiple alignment of conserved genomic sequence with rearrangements. Genome Res. 2004 Jul;14(7):1394-403.

April 21, 2023 35JC Setubal

36

RSA 493

RSA 331

Dugway

Chromosome alignmentMAUVE

April 21, 2023 JC Setubal

37

Genome Alignments MAUVE

April 21, 2023 JC Setubal

How MAUVE works

• Seed-and-extend hashing• Seeds/anchors: Maximal Multiple Unique

Matches of minimum length k• Result: Local collinear blocks (LCBs)• O(G2n + Gn log Gn), G = # genomes, n =

average genome length

April 21, 2023 38JC Setubal

Alignment algorithm

1. Find Multi-MUMs2. Use the multi-MUMs to calculate a phylogenetic

guide tree3. Find LCBs (subset of multi-MUMs; filter out spurious

matches; requires minimum weight)4. Recursive anchoring to identify additional anchors

(extension of LCBs)5. Progressive alignment (CLUSTALW) using guide tree

April 21, 2023 JC Setubal 39

Gene-centric comparisons

• Homologs: genes that have the same ancestor; in general retain the same function

• Orthologs: homologs from different species (arise from speciation)

• Paralogs: homologs from the same species (arise from duplication) – Duplication before speciation (ancient duplication)

• Out-paralogs; may not have the same function

– Duplication after speciation (recent duplication)• In-paralogs; likely to have the same function

April 21, 2023 40JC Setubal

Orthologs

April 21, 2023 41JC Setubal

speciation

Out-paralogs

April 21, 2023 42JC Setubal

April 21, 2023 JC Setubal 43

In-paralogs

44

Published April 16, 2008

10 genomes

Orthology+

Phylogeny

45

AG: ancestral (belli [2], canadensis) TG: typhus (prowasekii, typhi)TRG: transitional (akari, felis) SFG: spotted fever (rickettsii, conorii, sibirica)

46

How to find orthologs

• Desired features of ortholog clustering– Ability to distinguish between in- and out-paralogs

• In-paralogs should be clustered with their orthologs

– Ability to cluster genes that have the same domain architecture, rather than simply sharing just one domain

• Methods– Phylogenetic trees– BLAST– MCL– orthoMCL

April 21, 2023 47JC Setubal

OrthoMCL

• Li L, Stoeckert CJ Jr, Roos DS. OrthoMCL: identification of ortholog groups for eukaryotic genomes. Genome Res. 2003 Sep;13(9):2178-89

• Enright AJ, Van Dongen S, Ouzounis CA. An efficient algorithm for large-scale detection of protein families. Nucleic Acids Res. 2002 Apr 1;30(7):1575-84

April 21, 2023 JC Setubal 48

OrthoMCL

1. BLAST all-against-all2. weighting scheme 3. MCL algorithm• Nota bene: orthoMCL is not perfect!

– Two or more families may be wrongly joined– One family may be wrongly split

April 21, 2023 JC Setubal 49

Li Li et al. Genome Res. 2003; 13: 2178-2189

orthoMCL pipeline

Li Li et al. Genome Res. 2003; 13: 2178-2189

OrthoMCL weighting scheme for similarity graph

52

(Tribe)MCL

• Enright, Van Dongen, Ouzonis [2002]• Adaptation of MCL clustering algorithm of Van Dongen• Markov cluster• Simulates random walks in the graph• Expands and inflates certain matrices until equilibrium is

reached• Expansion: matrix squaring• Inflation: make expanded matrix become stochastic• Has been reasonably validated

Gene Set Computations

• Given a set of genomes, represented by their ‘proteomes’ or sets of protein sequences

• Given homlogous relationships (as given for example by orthoMCL)– Which genes are shared by genomes X and Y?– Which genes are unique to genome Z?– Venn or extended Venn diagrams

April 21, 2023 53JC Setubal

3-way genome comparison

April 21, 2023 JC Setubal 54

AB

C

Brucella gene set computations

April 21, 2023 JC Setubal 55

Joining synteny and homology

April 21, 2023 56JC Setubal

Ortholog setBuilder (orthoMCL)

Genome 1

Genome 2

Genome n

Script 1

HTML Tables

Script 2

OAK: ortholog alignment for prokaryotes

graph

report annotatorsApril 21, 2023 57JC Setubal

R/G S4 C58 K84 R. etliR. leguminosarum

S. melilotiM. loti MAFF

M. loti BNC

12nd chromosome

linear chromosome

2nd chromosome

- - - - -

2 plasmid 630kb AT plasmid plasmid 390kbplasmid F 640kb

plasmid pRL12 870kb

plasmid pSymA

plasmid 1 plasmid 1

3 plasmid 259kb Ti plasmid plasmid 179kbplasmid E 510kb

plasmid pRL11 680kb

plasmid pSymB

plasmid 2 plasmid 2

4 plasmid 210kb - plasmid 44kbplasmid D 370kb

plasmid pRL10 490kb

- - plasmid 3

5 plasmid 130kb - -plasmid C 250kb

plasmid pRL9 350kb

- - -

6 plasmid 79kb - -plasmid A 190kb

plasmid pRL7 150kb

- - -

7 - - -plasmid B 180kb

plasmid pRL8 150kb

- - -

Replicon color key for HTML tables

April 21, 2023 58JC Setubal

April 21, 2023 59JC Setubal

April 21, 2023 60JC Setubal

April 21, 2023 61JC Setubal

What do the tables show

• conserved blocks (aka “microsyntenic regions”), and how these blocks appear in different replicons across the genomes compared

• some of these blocks are not operons (would need to show strand)

• possible block losses

April 21, 2023 62JC Setubal

Polymorphism detection

• inDels, SNPs• pseudogenes

April 21, 2023 63JC Setubal

I

II

Figure 4.

65

Pseudogenes

• Nonfunctional protein coding genes• Mutations introduce “sequence problems”

(frameshifts, stop in frame, absence of stop)• Natural mutation or sequencing error?

66

Pseudogene cases

67

• “Normal” bacterial genomes have 1-5% of pseudogenes [Liu et al]

• Pseudogenes can give interesting clues to evolutionary pathways

Why study pseudogenes?

68

Why study pseudogenes? Cont’d

• High fractions of pseudogenes suggest a “genome degradation” process

• May be cause or effect of niche restriction• Examples

– Mycobacterium leprae: 36% (~1,100 genes)– Leifsonia xyli subsp. xyli: 13% (~300 genes)

• Pseudogenes do not show up in BLAST searches– Ortholog computations will in general not include them!

69

BLASTN

Annotated Pseudogenes vs. Genome Sequences

Previously Known PseudogeneKnown Gene (Homologous to Pseudogene)Newly Identified Pseudogene

Pseudogene Identification by Sequence SimilarityStudy of 8 Brucella Genomes

Brucella Pseudogene Analysis

Identification of New Pseudogenes by Homology

0

100

200

300

400

500

600

Bab9941 BabS19 Bcan23365 Bmel16M Bab2308 Bovi25840 Bsui1330 Bsui23445

PG Count: Initial

Tot. A lignments

Know n Genes

PG Count: Final

Total Alignments 4120 0.98

Gene hits 2627 0.62

pseudogenes 1493 0.35

Genomics is just the beginning

Genomics/proteomics

Interactions between molecules

Cell processes

complexity

Whole organisms

April 21, 2023 70JC Setubal

populations

21st century Biology: integration

April 21, 2023 JC Setubal 71

Acks

• Nalvo Almeida• Chris Lasher• Brett Tyler• Rebecca Wattam

April 21, 2023 JC Setubal 72

top related