phylogenomics. phylogenetics phylogenomics reconstruction of phyletic relationships based on the...

Phylogenomics

Phylogenetics

Phylogenomics

reconstruction of phyletic relationships based on the analysis of -

- several (to several dozens) genes

- complete genetic information (ideal)- several dozens to hundreds of coding sequences (phylotranscriptomics)

vast amount of genetic information should significantly improve the prediction of phylogenetic relationships and

eliminate signal noise

... and sometimes it really works

Adl et al, 2012

... but sometimes it doesn’t

possible source of error: - incorrect sequence annotation

possible source of error: - paralogues

possible source of error: - sins of the past

“LEUCA”

...ANIMALS/FUNGI PLANTS RHODOPHYTES

18(16)S rRNA

- combination of variable and conserved regions

- zero L/HGT- exhaustive taxon sampling- known secondary structure- hundreds of copies per cell -

single-cell PCR- cost per nt + speed- ‘18S is always right’

- - ~1800bp- intraindividual paralogues- lower branching support

MULTI-PROTEIN DATASETs

+ - large ammount of information- modular- robust branching support

(although often false)

- - limited sampling- variable quality of

phylohenetic signal- L/H(E)GT- still costly and slow- HW demanding analysis- stability of topologies (or lack

of thereof)

DATABASE

PURIFIED DATABASE

HOMOLOGUES

DATASETS

CONCATENATION

DATABASE

PURIFIED DATABASE

HOMOLOGUES

DATASETS

CONCATENATION

- lots of redundancy in dbs (duplicates, close paralogues...)- usually it is better to get rid of them

sequence clustering

+ - speed, relative HW friendly, accuracy

- - accuracy, black-box

CD-HIT

USEARCH

DB editing

FASTA – universal and simple!, but non unified

NCBI:>gi|269120277|ref|YP_003308454.1| carbamate kinase [Sebaldella termitidis ATCC 33386]MKNRIVVALGGNALGNSAKEQRDAVRETAIPIVDLIEAGHEVILAHGNGPQVGMINLAMDSATKNLPSFAEMPITECVAMSQGYIGYHLQRFIRDELKRRNIDKEVATIVTEVLVDGDDPAFKSPNKPIGAFYTKEEAEKLEKQGYTMMEDAGRGYRRVVASPKPVDIVQKKTIKTLIDNSQIVITVGGGGIPVKYVEGKGTLGEFAVIDKDFASAKLAELIDADYLIILTAVEKIAINYGKENEQWLDKLSIDDAKKYIKEGHFAPGSMLPKVEAALGFAASKQGRRALVTSLEKAKDGIAGLTGTVIVDEK

JGI:>jgi|Dappu1|290510|JCO_fgenesh1_kg.C_scaffold_4000019MKLVYTVASAFLVVLIAQSAYASEKLSAQDYAYNSTCLNHLRSHIKRELQAAVTYLAMGAWANHYSVQRPGLANFFFDSASEEREHGLKLLGYLRMRGHNDLDILPSSLEPLNGKYEWENSLSALRQALKMEKDVTESIKKIIDYCADAEDHQLADYLTGDFMEEQLKGQRNVAGLANTLQGVLRKQPRLGEWIFDNNLSKSMAV

manual for several sequences but several thousands?

GB's of RAM

robust OS and text editor

!Regular expressions!

>gi|269120277|ref|YP_003308454.1| carbamate kinase [Sebaldella termitidis ATCC 33386]MKNRIVVALGGNALGNSAKEQRDAVRETAIPIVDLIEAGHEVILAHGNGPQVGMINLAMDSATKNLPSFAEMPITECVAMSQGYIGYHLQRFIRDELKRRNIDKEVATIVTEVLVDGDDPAFKSPNKPIGAFYTKEEAEKLEKQGYTMMEDAGRGYRRVVASPKPVDIVQKKTIKTLIDNSQIVITVGGGGIPVKYVEGKGTLGEFAVIDKDFASAKLAELIDADYLIILTAVEKIAINYGKENEQWLDKLSIDDAKKYIKEGHFAPGSMLPKVEAALGFAASKQGRRALVTSLEKAKDGIAGLTGTVIVDEK

Find:>\w+\|\d+\|\w+\|(\w+).*\[(\w+\s\w+).*

Replace:>\2_\1

>Sebaldella termitidis_YP_003308454MKNRIVVALGGNALGNSAKEQRDAVRETAIPIVDLIEAGHEVILAHGNGPQVGMINLAMDSATKNLPSFAEMPITECVAMSQGYIGYHLQRFIRDELKRRNIDKEVATIVTEVLVDGDDPAFKSPNKPIGAFYTKEEAEKLEKQGYTMMEDAGRGYRRVVASPKPVDIVQKKTIKTLIDNSQIVITVGGGGIPVKYVEGKGTLGEFAVIDKDFASAKLAELIDADYLIILTAVEKIAINYGKENEQWLDKLSIDDAKKYIKEGHFAPGSMLPKVEAALGFAASKQGRRALVTSLEKAKDGIAGLTGTVIVDEK

extremely powerful, easy to learn, fun to use:

!Regular expressions!

BLAST vs annotation

BLAST (plus relatives) is the only reliable way to identify homologues, do not rely on annotation!

the more the better

beware of close paralogues! Meticulous SGF necessary

DATABASE

PURIFIED DATABASE

HOMOLOGUES

DATASETS

CONCATENATION

possible source of error: - paralogues

commercialDATABASE

PURIFIED DATABASE

HOMOLOGUES

DATASETS

CONCATENATION

vs. free

- both (shiny GUI/command-line scripts) will get you there relatively fast and easy but... beware of possible errors, there is no universal solution

DATABASE

PURIFIED DATABASE

HOMOLOGUES

DATASETS

CONCATENATION

Multiple alignment

- important and necessary step in identification and definition of dna or protein domains,oligonucleotide design, phylogenetic analyses...

- most of the modern algorithms are iterative (can self-improve during the iterations) and reasonably good working (really, don’t use Clustal unless you really have to), some of the most used are:MAFFT, MUSCLE, Kalign, ProbCons (none of them miraculous, each makes mistakes, but it’s not that bad)

- all of the above mentioned are accessible on-line (follow the hyperlinks) or can be run locally... nevertheless, you’ll have to use some alignment-viewer/editor to visualize them

- several free options (depending on what OS you use) MS WIndows: Bioedit- the living legend’, extensive features, user-friendly, can import from GenBank, align (also translation alignment, although with ), edit, annotate, translate, do phylogeny... Mac: MacClade - great editing features and them some more, user friendly, but doesn’t align, nor does phylogenies currently work only up to OSX 10.6, not (mountain) lion. Multi-platform: MEGA – good for alignment, phylogenetic and molecular evolution analyses

Jalview – excellent for proteomics, passable alignment editor

SeaView – great aligner/editor (although takes time to get use to it), excellent features for phylogenetics (inclusion sets, translation alignment, there’s

no UNDO button!)... and then again, if you have access/can afford Geneious (student licenses are cheap), you can skip everything listed above

Editing

- remember: the tree is as good as is the alignment; crap-in-crap-out!- the goal is to keep only unambiguously aligned regions and relevant OTU (remove duplicates or long-branchers)

site selection: AUTOMATED vs. MANUAL

automated: good as a starting point, reproducible, ‘objective’, transparent, but ... crudemanual: subjective, often non-reproducible, needs ‘expertise’, but... better (usually), can be fine-tuned to the each respective dataset

Example- SeaViewopen dataset (in this case apicomplexa_ssu1.fas) and align it.. you already know how, right?

Some regions are conserved (i.e., not much divergent diversity), there’s little doubt about the correctness of alignment. They should be kept for analysis as they carry vital information.

Example- SeaView

On the other hand, some are pretty variable and could be aligned in several ways. Because we cannot be sure the information they contain is correct, we should exclude these prior to analysis in order not to introduce error (remember, crap-in-crap-out).

In some situations, especially when you’re fresh to the problematics, it is not so clear what parts of alignments should be kept and what excluded from analysis. Gblocks (or similar SW) can help you. Luckily, it is also implemented in SeaView: as it tends to remove too much, let’s keep the

parameters the least strict

regions with X are kept, those with dashes excluded from selectionyou can edit the selection afterwards and save it using Files-Save selection

you can also directly perform phylogenetic analysis by clicking on Trees

you can choose from three different methods, PhyML represents Maximum likelihood

the default settings are reasonable compromise between speed and precision, so you can leave them on

for publication, you will have to also assess branching support

and you may want to use more thorough algorithm of tree search (check ‘Best of NNI and SPR’)

then hit Run ... and wait ... time depends on the method and size of dataset (obviously, the bigger the longer).

OUTGROUP (root)

branch scale bar (substitutions per site – the longer the branch, the more divergent the sequence)

node (represents hypothetical ancestor of all taxa/branches stemming off the node, also defines clade)

clade (group of sequences sharing common ancestor/stemming from single node)

sister taxa(two taxa forming clade )

sister clades

SeaView has also implemented very decent tree viewer/editor

you can also create several subsets of alignments (inclusion sets) by clicking Sites-Create set

and give it the name

parts of sequences above X (highlighted) are included in selection. You can select the sites by combination of right- and left-clicks (left unselect point sites, right removes selection between two unselected regions, single left-click select single site, by holding left and moving mouse, you can re-select the whole regions) I know, it sounds awkward... TRY TO PRACTICE iT!you can then duplicate-rename and create different inclusion sets and Save just selection, not the whole alignment. This feature can be extremely useful in phylogenies and sets the SeaView apart from the others alignment editors (will get to it next time)

Coding sequences should be aligned in ‘translation’ mode – temporarily translated into and aligned as amino acids and back-translated into nucleotides keeping the alignment positions

in SeaView click Props-View as proteins

uncheck View as proteins

now, the sequences are aligned according ORF

Phylogenetic inferenceYou don’t have to use the state-of-art phylogenetic methods for initial analysis/es, which purpose is

to (quickly) identify redundancy (duplicates and very similar sequences), aberrant and very divergent sequences or the need to extend the dataset (quite often, you realize, you should’ve add some other taxa). For that, simple neighbour joining tree based on J-C, K2P or HKY model, or stripped-down maximum likelihood run (without gamma categories and branching support) would suffice and do the job quickly even on some older computers.

On the other hand, for the purpose of the publication (or if you want to be sure), once you’ve polished your dataset, you should use the best (possible) methods. That usually means Maximum-likelihood with gamma-corrected and GTR (nucleotides) of LG or WAG (amino acids) substitution matrices (or models of evolution, if you wish... these matrices tells computer, how probable is change from one state to another). But, it all depends on the dataset... if the sequences are similar and/or there’re just few of them, it may be preferable to use simpler matrices/models. There are also some models dedicated to the organellar genomes and/or specific taxonomic groups (like mtArt, which is tailored for analysis of mitochondrial genes of arthropods). There are some programs to tell you, which model suits your dataset the best (for example jModeltest for nucleotides and ProtTest (available also as a server).

The credibility of topology should be ‘tested’ using (non-parametric) bootstrap analysis, during which software creates subreplicates made of random parts of the sequences (all taxa are included) and infers topology form these subreplicates instead of the original dataset. For the purpose of the publication 100 replicates are a bare minimum, the reviewer will probably require 300 or higher number though. If the analysis is meant just for you (or your boss), 100 is totally enough (in my opinion), alternatively you can use even faster method called ‘approximate Likelihood-Ratio test’ (aLRT, implemented in some software).

Nowadays, most reviewers/editors will also require another type of phylogenetic analysis called Bayesian inference. Here, you use the same (similar) models, but the method of topology search is totally different, also, the branching support is expressed as a posterior probability (ranging from 0-1), instead of bootstrap values. Be careful with interpretation of these two values. In bootstrap, everything higher than 50 (meaning the topology appeared in at least 50% of the replicates) is considered to be supported (although weakly), the more you approaching 100, the more confident you could be with the branching. OTOH, the posterior probability anything bellow 0.95 (some go to 0.90) shall be considered as unsupported! Only nodes with 1.0 (or 0.99) PP value are considered to be strongly supported.

Phylogenetic inference - softwareSurprisingly lot software is available (given the obscurity of the topic, almost-exhaustive list to be found here), but most are either too specialized, slow, obsolete or not worth use from some different reasons . Unfortunately, most (like 99%) are command-line based without any user-friendly graphic interface. But some of the good/passable are implemented in SW with GUI (like SeaView or Geneious) or at least have server-version. So, here is the short list some recommended phylogenetic software:

Ambiguous regions detection/removal: several SW, but nothing exciting, try Aliscore or Gblock (server)

Distance methods: PAUP (commercial), Phylip, BioNJ

Maximum Parsimony: PAUP (commercial), Phylip

Maximum likelihood: RAxML (server), PhyML (server), FastTree (REALLY fast, great for preliminary analyses), garli

Bayesian Inferrence: MrBayes, Phylobayes

Tree Viewer/Editor: NJplot (improved version also implemented in SeaView), FigTree, Treeview

this list is far from being exhaustive, but above noted SW should fit general audience (like you ) in terms of purpose and performance.

meticulous analysis of SGP is necessary!!!

you could use also the automated approach (Phylosorter), but the risk of error is quite a significant and the parameters should be as strict as possible

DATABASE

PURIFIED DATABASE

HOMOLOGUES

DATASETS

CONCATENATION

‘clean’ datasets could be merged (concatenated) into the supermatrix

Scafos, phyutility, SeaView, MacClade, Bioedit?...

DATABASE

PURIFIED DATABASE

HOMOLOGUES

DATASETS

CONCATENATION

- both SW and HW demanding- due to the amount of data. the most complex models are

necessary, prone to errors and time consuming

+ SHOULD produce robust results

Multi-Gene PhylogeniesDATABASE

PURIFIED DATABASE

HOMOLOGUES

DATASETS

CONCATENATION

why? - poor taxon sampling - too weak/strong phylogenetic signal - violation of the model assumptions (different base composition, mutation rates...) - inappropriate model used

phylogenetic artifacts

Long-Branch Attraction (LBA)

- the most (in)famous and common artifact- high evolutionary rates cause artificial grouping of long-branching taxa

- adding more genes

Artifacts elimination

2012 - 2582009 - 1272008 - 135same author – different datasets

- adding more genes

- adding more taxa

- poor taxon sampling is considered to be the most common reason- ideally, all taxa should be included- reasonably, all relevant and available taxa should be included- realistically, we have to work with the few available

- adding genes to MGP- adding more taxa

- removal of problematic (fast-evolving) taxa- improving methodology

- analysis of dataset with different combination of taxa and comparison of resulting topologies

- efficient way to over-come the LBA

- current HW a SW enable application of the state-of-art models- LG4M, LG4X (RAxML)- CAT(+GTR): each position of alignment has specific equilibrium and model

parameters- covarion, non-homogenous: each taxon has specific rate of evolution- HW and time demanding!

- adding genes to MGP- adding more taxa- removal of problematic (fast-evolving) taxa

- improving methodology

- simple and fast way to reduce signal noise- for each gene, we compute overall ML distance and remove the the

most divergent genes

- TREEPUZZLE, RAxML

- adding genes to MGP- adding more taxa- removal of problematic (fast-evolving) taxa- improving methodology

- removal of fast evolving genes

- usually more efficient- each site of alignment is assigned to specific rate category (usually

8/16)- the highest category(ies) are removed- dependent on topology/model- TREEPUZZLE, AIRremover

- adding genes to MGP- adding more taxa- removal of problematic (fast-evolving) taxa- improving methodology- removal of fast-evolving genes

- removal of fast-evolving sites

- for datasets with a large proportion of saturated sites

- amino acids are recoded according to their biochemical properties to four categories (Dayhoff matrix)

- adding genes to MGP- adding more taxa- removal of problematic (fast-evolving) taxa- improving methodology- removal of fast-evolving genes- removal of fast-evolving sites

- decoding of aa

- clever, but is it kosher? ... doesn’t work that well anyway - concaterpillar

- adding genes to MGP- adding more taxa- removal of problematic (fast-evolving) taxa- improving methodology- removal of fast-evolving genes- removal of fast-evolving sites- decoding of aa

- selection of genes with congruent signal

Phylogenomics is (not)surprisingly hard to publish, usually you have to do combination of at least few above to satisfy the reviewers!

- adding genes to MGP- adding more taxa- removal of problematic (fast-evolving) taxa- improving methodology- removal of fast-evolving genes- removal of fast-evolving sites- decoding of aa- selection of genes with congruent signal

So... is it worth when quite often you get the same topology as with SSU rRNA?

phylogenomics. phylogenetics phylogenomics reconstruction of phyletic relationships based on the...

octatcpossible source

doesntpossible source

cell single

cell pcrcost

past egtleuca

branching support

s rrna combination of

text editor

Documents

genome-scale phylogenomics

uc davis eve161 lecture 16 by @phylogenomics

“punctuated equilibria: an alternative to phyletic...

phylogenomics symposium and software school

uc davis eve161 lecture 11 by @phylogenomics

phylogenomics and coalescent analyses resolve extant seed

microbial phylogenomics (eve161) class 15: shotgun...

colarusso - phyletic links between proto-indo-european and...

real-time phylogenomics: joe parker

phylogenomics and the diversification of microbes

uc davis eve161 lecture 18 by @phylogenomics

kinetoplastid phylogenomics reveals the ...€¦ · current...

uc davis eve161 lecture 13 by @phylogenomics

polynomial supertree methods in phylogenomics · polynomial...

linking bacterivory and phyletic diversity of protists with...

genome-scale phylogenomics

phylogenomics - github pages

trans-phyletic conservation of developmental regulatory...

introduction to phylogenomics and metagenomics

phyletic coevolution between subterranean rodents of the