phylogenetic analysis taken from avierstr avierstr and es/msaphylogeny.htm

80
Phylogenetic analysis taken from http://allserv.rug.ac.be/~avie rstr and http://www.cs.otago.ac.nz/cosc 348/Lectures/MSAPhylogeny.htm And Introduction to Bioinformatics course slides

Upload: willis-thomas

Post on 17-Jan-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Phylogenetic analysis taken from avierstr avierstr and  es/MSAPhylogeny.htm

Phylogenetic analysis

taken from http://allserv.rug.ac.be/~avierstr

and

http://www.cs.otago.ac.nz/cosc348/Lectures/MSAPhylogeny.htm

And

Introduction to Bioinformatics course slides

Page 2: Phylogenetic analysis taken from avierstr avierstr and  es/MSAPhylogeny.htm

Purpose of phylogenetics :• Reconstruct the evolutionary relationship between species Experience learns that closely related organisms have similar

sequences, more distantly related organisms have more dissimilar sequences.

• Estimate the time of divergence between two organisms since they last shared a common ancestor.

But…• The theory and practical applications of the different models are not

universally accepted. • Important to have a good alignment to start with. (Garbage in,

Garbage out)• Trees based on an alignment of a gene represent the relationship

between genes and this is not necessarily the same relationship as between the whole organisms. If trees are calculated based on different genes from organisms, it is possible that these trees result in different relationships.

Page 3: Phylogenetic analysis taken from avierstr avierstr and  es/MSAPhylogeny.htm

Why is phylogeny imporant

• Determining tree of life (e.g., for a new organism)

• Determining gene function

• Understand which parts of the gene/regulatory sequences are important

• Tracing the evolution of genes – horizontal gene transfer etc.

Page 4: Phylogenetic analysis taken from avierstr avierstr and  es/MSAPhylogeny.htm

Protein or DNA?

• As with Multiple Sequence Alignment – proteins are preferred– More informative– Shorter in length– Less chance of multiple mutations at the

same site

• When DNA?– A non-coding sequence– Proteins too similar

Page 5: Phylogenetic analysis taken from avierstr avierstr and  es/MSAPhylogeny.htm

Terminology :• node : a node represents a

taxonomic unit. This can be a taxon (an existing species) or an ancestor (unknown species : represents the ancestor of 2 or more species).

• branch : defines the relationship between the taxa in terms of descent and ancestry.

• topology : is the branching pattern.

• branch length : often represents the number of changes that have occurred in that branch.

• root : is the common ancestor of all taxa.

• distance scale : scale which represents the number of differences between sequences (e.g. 0.1 means 10 % diff

Page 6: Phylogenetic analysis taken from avierstr avierstr and  es/MSAPhylogeny.htm

Possible ways of drawing a tree :

Unscaled branches : the length is not proportional to the number of changes.

Page 7: Phylogenetic analysis taken from avierstr avierstr and  es/MSAPhylogeny.htm

Possible ways of drawing a tree :

•Scaled branches : the length of the branch is proportional to the number of changes (usually in PAMs). The distance between 2 species is the sum of the length of all branches connecting them.

Page 8: Phylogenetic analysis taken from avierstr avierstr and  es/MSAPhylogeny.htm

Possible ways of drawing a tree :

• Rooted trees: the root is the common ancestor. The direction of each path from the root corresponds to evolutionary time.• Unrooted tree: specifies the relationships among species and does not define the evolutionary path.

Page 9: Phylogenetic analysis taken from avierstr avierstr and  es/MSAPhylogeny.htm

9

Rooted vs. unrooted trees

1

2

3

3 1

2

Page 10: Phylogenetic analysis taken from avierstr avierstr and  es/MSAPhylogeny.htm

10

The position of the root does not affect the MP score.

Rooted vs. Unrooted.

Page 11: Phylogenetic analysis taken from avierstr avierstr and  es/MSAPhylogeny.htm

11

s1 s4 s3 s2 s5

Gene number 1

1 1 1 0 0

1

0

1 or 0

Intuition why rooting doesn’t change the score

The change will always be on the same branch, no matter where the root is positioned…

1

Page 12: Phylogenetic analysis taken from avierstr avierstr and  es/MSAPhylogeny.htm

12

We want rooted trees!

How can we root the tree?

Page 13: Phylogenetic analysis taken from avierstr avierstr and  es/MSAPhylogeny.htm

13

Page 14: Phylogenetic analysis taken from avierstr avierstr and  es/MSAPhylogeny.htm

14

Page 15: Phylogenetic analysis taken from avierstr avierstr and  es/MSAPhylogeny.htm

15

Gorilla gorilla

(Gorilla)

Homo sapiens (human)

Pan troglodytes (Chimpanzee)

Gallus gallus (chicken)

Page 16: Phylogenetic analysis taken from avierstr avierstr and  es/MSAPhylogeny.htm

16

Evaluate all 3 possible UNROOTED trees:

Human

Chimp

Chicken

Gorilla

Human

Gorilla

Chimp

Chicken

Human

Chicken

Chimp

Gorilla

MP tree

Page 17: Phylogenetic analysis taken from avierstr avierstr and  es/MSAPhylogeny.htm

17

Rooting based on a priori knowledge:

Human

Chimp

Chicken

Gorilla

Human ChimpChicken Gorilla

Page 18: Phylogenetic analysis taken from avierstr avierstr and  es/MSAPhylogeny.htm

18

Ingroup / Outgroup:

Human ChimpChicken Gorilla

INGROUPOUTGROUP

Page 19: Phylogenetic analysis taken from avierstr avierstr and  es/MSAPhylogeny.htm

Tree of life

Page 20: Phylogenetic analysis taken from avierstr avierstr and  es/MSAPhylogeny.htm

Distance-based methods

• Compress all of the individual differences between pairs of sequences into a single number – the distance.

• Starting from an alignment, pairwise distances are calculated between DNA sequences as the sum of all base pair differences between two sequences (the most similar sequences are assumed to be closely related. This creates a distance matrix.

• From the obtained distance matrix, a phylogenetic tree is calculated with clustering algorithms. These cluster methods construct a tree by linking the least distant pair of taxa, followed by successively linking more distant taxa.

• Algorithms: UPGMA clustering , Neighbor Joining.• Assumes molecular clock ClustalW!

Page 21: Phylogenetic analysis taken from avierstr avierstr and  es/MSAPhylogeny.htm

Cladistic methods• Trees are calculated by considering the various possible

pathways of evolution and are based on parsimony or likelihood methods. These methods use each alignment position as evolutionary information to build a tree.

• Parsimony : Looks for the most parsimonious tree: the tree with the fewest evolutionary changes for all sequences to derive from a common ancestor.

• Slower than distance methods.• Assumes molecular clock• Maximum Likelihood : Looks for the tree with the maximum

likelihood: the most probable tree. • this is the slowest method of all but seems to give the best

result and the most information about the tree. • No molecular clock assumption

Phylip

Phylip

Page 22: Phylogenetic analysis taken from avierstr avierstr and  es/MSAPhylogeny.htm

Two homologous DNA sequences which descended from an ancestral sequence and accumulated mutations since their divergence from each other. Note that although 12 mutations have accumulated, differences can be detected at only three nucleotide sites.

Even the best evolutionary models can't solve this problem...

Page 23: Phylogenetic analysis taken from avierstr avierstr and  es/MSAPhylogeny.htm

Molecular clocks

• Assumption: constant rate of evolution

• Different rate for different genes:

Millions of years since divergence

Dickerson, 1971

Page 24: Phylogenetic analysis taken from avierstr avierstr and  es/MSAPhylogeny.htm

Human insulin

Page 25: Phylogenetic analysis taken from avierstr avierstr and  es/MSAPhylogeny.htm

Insulin multiple alignment

Page 26: Phylogenetic analysis taken from avierstr avierstr and  es/MSAPhylogeny.htm

Surprisingly, insulin from the guinea pig evolved seven times faster than insulin from other species. Why?

The answer is that guinea pig insulin does not bind two zinc ions, while insulin molecules from most other species do. There was a relaxation on the structural constraints of these molecules, and so the genes diverged rapidly.

Problems with molecular clocks

Page 27: Phylogenetic analysis taken from avierstr avierstr and  es/MSAPhylogeny.htm

Building trees with ClustalW

Place alignment here

Choose a tree here

http://www.ebi.ac.uk/clustalw/

Page 28: Phylogenetic analysis taken from avierstr avierstr and  es/MSAPhylogeny.htm

PHYLIP

• A suite of phylogeny tools

• Both web servers and stand-alone applications

• Used for distance/parsimony/maximum likelihood

• http://bioweb.pasteur.fr/seqanal/phylogeny/phylip-uk.html

Page 29: Phylogenetic analysis taken from avierstr avierstr and  es/MSAPhylogeny.htm

Sequences

Page 30: Phylogenetic analysis taken from avierstr avierstr and  es/MSAPhylogeny.htm

Bootstrapping

• Assigns confidence to individual tree branches

• Columns of the alignment are randomly sampled (with replacement) and the tree is recomputed X many interactions

• Boorstrap value of a branch = how many iterations had it.

Page 31: Phylogenetic analysis taken from avierstr avierstr and  es/MSAPhylogeny.htm

Collections of homologous genes

• Homologene @ Entrez– http://www.ncbi.nlm.nih.gov/sites/entrez?db=homolog

ene• COG – Clusters of Orthologous Genes

– Results of Blast All-vs-All between genomes. Genes within the same COG are “pairwise best hits”

– http://www.ncbi.nlm.nih.gov/COG/• RDP – Ribosomal sequences

– The “standard” sequences for doing species phylogeny

– Focused on Bacteria– http://rdp8.cme.msu.edu/html/

Page 32: Phylogenetic analysis taken from avierstr avierstr and  es/MSAPhylogeny.htm

32

Orthologs

Homologous sequences are orthologous if they were separated by a speciation event:

If a gene exists in a species, and that species diverges into two species, then the copies of this gene in the resulting species are orthologous.

Page 33: Phylogenetic analysis taken from avierstr avierstr and  es/MSAPhylogeny.htm

33

Orthologs

• Orthologs will typically have the same or similar function in the course of evolution.

• Identification of orthologs is critical for reliable prediction of gene function in newly sequenced genomes.

Page 34: Phylogenetic analysis taken from avierstr avierstr and  es/MSAPhylogeny.htm

34

Orthologs

speciation

ancestor

descendant 2 (e.g., dog)descendant 1 (e.g., human)

Page 35: Phylogenetic analysis taken from avierstr avierstr and  es/MSAPhylogeny.htm

35

Paralogs

Homologous sequences are paralogous Homologous sequences are paralogous if they were separated by a if they were separated by a gene gene duplication duplication event: event:

If a gene in an organism is duplicated, If a gene in an organism is duplicated, then the two copies are paralogous. then the two copies are paralogous.

Page 36: Phylogenetic analysis taken from avierstr avierstr and  es/MSAPhylogeny.htm

36

Paralogs

• Orthologs will typically have the same or similar function.

• This is not always true for paralogs due to lack of the original selective pressure upon one copy of the duplicated gene, this copy is free to mutate and acquire new functions.

Page 37: Phylogenetic analysis taken from avierstr avierstr and  es/MSAPhylogeny.htm

37

Paralogs

Duplication

Page 38: Phylogenetic analysis taken from avierstr avierstr and  es/MSAPhylogeny.htm

38(taken from NCBI)

Page 39: Phylogenetic analysis taken from avierstr avierstr and  es/MSAPhylogeny.htm

39

Using BLAST and phylogeny to study gene evolution

Page 40: Phylogenetic analysis taken from avierstr avierstr and  es/MSAPhylogeny.htm

40

Mol. Biol. Evol. (2005) 22:598-606

Page 41: Phylogenetic analysis taken from avierstr avierstr and  es/MSAPhylogeny.htm

41

Evolutionary rate and conservation

Functionally or structurally important sites are conserved:

Conserved sites “slow” evolving sitesVariable sites “fast evolving” sites

Sites which are under a functional/structural constraint are conserved, and evolve slowly

Page 42: Phylogenetic analysis taken from avierstr avierstr and  es/MSAPhylogeny.htm

42

Conservation in an MSA

S1 KITAYCELARTDMKLGLDFYKGVSLANWVCLAKWESGYN S2 MPFERCELARTLKRMADADIRGVSLANWVCLAKWFWDGGS3 MPFERCELARTLKRMMDADIRGVSLANWVCLAKWFWDGG

From the MSA (and the tree), one can determine how From the MSA (and the tree), one can determine how conserved is a gene.conserved is a gene.

Page 43: Phylogenetic analysis taken from avierstr avierstr and  es/MSAPhylogeny.htm

43

“Inverse relation between evolutionary rate and age of mammalian genes”: Protocol

Page 44: Phylogenetic analysis taken from avierstr avierstr and  es/MSAPhylogeny.htm

44

Step 1 - BLAST

Build the dataset of mammalian genes

Page 45: Phylogenetic analysis taken from avierstr avierstr and  es/MSAPhylogeny.htm

45

Step 1 – BLAST: build the dataset of mammalian genes, based on

mouse-human ortholog pairs• The orthologs are defined as pairs of

reciprocal BLAST hits.

• Eliminate genes with more than one potential orthologous sequence.

• Select only genes which the human protein was functionally annotated.

Page 46: Phylogenetic analysis taken from avierstr avierstr and  es/MSAPhylogeny.htm

46

Step 2 – Calculate conservation

Page 47: Phylogenetic analysis taken from avierstr avierstr and  es/MSAPhylogeny.htm

47

Step 2 – Calculate Evolutionary Rates (Conservation)

For each orthologous pair:

• Alignment at the amino acid level.

• Measure evolutionary rate

The dataset contained 6,776 human-mouse gene pairs.

Page 48: Phylogenetic analysis taken from avierstr avierstr and  es/MSAPhylogeny.htm

48

Step 3 – Assignment of Temporal Categories

How old is each gene? Used BLAST to find homologs in 6 different eukaryotic genomes

Page 49: Phylogenetic analysis taken from avierstr avierstr and  es/MSAPhylogeny.htm

49

Caenorhabditis elegans Schizosaccharo

myces pombe

Takifugu rubripes

Drosophila melanogaster

Arabidopsis thaliana

Saccharomyces cerevisiae

Page 50: Phylogenetic analysis taken from avierstr avierstr and  es/MSAPhylogeny.htm

50

What is Old ?• Presence of any

homolog in all the 6 genomes.

What is Presence ? Using an e-value cutoff of

10-4 in BLAST.

OLD

METAZOANS

DEUTEROSTOMES

TETRAPODS

Page 51: Phylogenetic analysis taken from avierstr avierstr and  es/MSAPhylogeny.htm

51

• METAZOANS - Organisms whose bodies consist of many cells, as distinct from Protozoa, which are unicellular; also commonly called animals.

• DEUTEROSTOMES - The second of the two main groups of bilaterally symmetrical animals. The name derives from 'deutero' (second) 'stome' (mouth), referring to the origin of the definitive mouth as an opening independent from the blastopore of the embryo.

• TETRAPODS - Any four-legged animals, including mammals, birds, reptiles and amphibians.

Page 52: Phylogenetic analysis taken from avierstr avierstr and  es/MSAPhylogeny.htm

52

Human

Mouse

Fish

Insect

Worm

Yeast

Plant

Tetrapods

Deuterostomes

Metazoa

Old (eukaryotes)

Page 53: Phylogenetic analysis taken from avierstr avierstr and  es/MSAPhylogeny.htm

53

Results

Page 54: Phylogenetic analysis taken from avierstr avierstr and  es/MSAPhylogeny.htm

54

Negative correlation between “age” of genes and the rate of evolution

Evolutionary rate

Evolutionary rate

Evolutionary rate

Evolutionary rate

Negative correlation between “age” of genes and the rate of evolution

Page 55: Phylogenetic analysis taken from avierstr avierstr and  es/MSAPhylogeny.htm

55

Control.

• Changing the sensitivity of the BLAST detection to a more conservative one of 10-10, did not significantly affect the result.

Page 56: Phylogenetic analysis taken from avierstr avierstr and  es/MSAPhylogeny.htm

56

Explanations

Page 57: Phylogenetic analysis taken from avierstr avierstr and  es/MSAPhylogeny.htm

57

• Functional constraints remained constant throughout the evolutionary history of each gene, but the newer genes are less constrained than older genes.

• Functional constraints are not constant, rather they are weak at the time of origin of a gene and they become progressively more stringent with age.

Page 58: Phylogenetic analysis taken from avierstr avierstr and  es/MSAPhylogeny.htm

58

Eran Elhaik, Niv Sabath, and Dan Graur

Mol. Biol. Evol. 23(1):1–3. 2006

Page 59: Phylogenetic analysis taken from avierstr avierstr and  es/MSAPhylogeny.htm

59

Goal

• To show that these results are an artifact caused by our inability to detect similarity when genetic distances are large.

Page 60: Phylogenetic analysis taken from avierstr avierstr and  es/MSAPhylogeny.htm

60

Simulation

Page 61: Phylogenetic analysis taken from avierstr avierstr and  es/MSAPhylogeny.htm

61

The evolutionary process

Rat

Dog

Cat

Mouse

Fly

Ala ArgVal

Ala

Arg

Val

Replacement probabilities

Page 62: Phylogenetic analysis taken from avierstr avierstr and  es/MSAPhylogeny.htm

62

The evolutionary process

Rat

Dog

Cat

Mouse

Fly

V

Ala ArgVal

Ala

Arg

Val

Replacement probabilities

Page 63: Phylogenetic analysis taken from avierstr avierstr and  es/MSAPhylogeny.htm

63

Rat

Dog

Cat

Mouse

Fly

V

V

The evolutionary process

Ala ArgVal

Ala

Arg

Val

Replacement probabilities

Page 64: Phylogenetic analysis taken from avierstr avierstr and  es/MSAPhylogeny.htm

64

Rat

Dog

Cat

Mouse

Fly

L

V

V

The evolutionary process

Ala ArgVal

Ala

Arg

Val

Replacement probabilities

Page 65: Phylogenetic analysis taken from avierstr avierstr and  es/MSAPhylogeny.htm

65

L

L

I

M

V

Rat

Dog

Cat

Mouse

Fly

L

L

V

V

The evolutionary process

Ala ArgVal

Ala

Arg

Val

Replacement probabilities

Page 66: Phylogenetic analysis taken from avierstr avierstr and  es/MSAPhylogeny.htm

66

Rat L M T G S H M G N F I IMouse L M T G S G M A N H V ICat I M T G S H I G Y A M FDog M M T G S G I G L T R A Fly V M T G S W R G R M Y A

The evolutionary process

...

And repeat the process for all positions…(assume: each position evolves independently)

Page 67: Phylogenetic analysis taken from avierstr avierstr and  es/MSAPhylogeny.htm

67

All the genes originated in the common ancestor of A,B,C,D,E and are, thus, of equal age.

Similar to the human and mouse orthologous genes.

Remote homologs from increasingly distant taxa (similar to fish, insect, yeast…)

The aim of the simulations: generate sequences with the following phylogenetic relationships:

DA B EC

Page 68: Phylogenetic analysis taken from avierstr avierstr and  es/MSAPhylogeny.htm

68

Simulation

• They simulated genes with 101 different rates.

• High rate higher likelihood for an amino acid replacement in each branch.

Page 69: Phylogenetic analysis taken from avierstr avierstr and  es/MSAPhylogeny.htm

69

After simulating the sequences:

Use BLAST, at the same way that Alba and Castresana used it, to detect homology between gene A to genes C,D and E.

Page 70: Phylogenetic analysis taken from avierstr avierstr and  es/MSAPhylogeny.htm

70

Only one difference – the groups names

OLD

METAZOANS

DEUTEROSTOMES

TETRAPODS

SENIORS

ADULTS

TEENAGERS

TODDLERS

Page 71: Phylogenetic analysis taken from avierstr avierstr and  es/MSAPhylogeny.htm

71

Results

Page 72: Phylogenetic analysis taken from avierstr avierstr and  es/MSAPhylogeny.htm

72

Same as Alba and Castresana

Page 73: Phylogenetic analysis taken from avierstr avierstr and  es/MSAPhylogeny.htm

73

But all the simulated genes are at the same

“age”.

What is the problem ???

Page 74: Phylogenetic analysis taken from avierstr avierstr and  es/MSAPhylogeny.htm

74

We can only count genes that are identified as homologous by the protocol … BLAST

Page 75: Phylogenetic analysis taken from avierstr avierstr and  es/MSAPhylogeny.htm

75

Alba and Castresana may have, thus, failed to spot the vast majority of homologs from among

the fastest evolving genes

Page 76: Phylogenetic analysis taken from avierstr avierstr and  es/MSAPhylogeny.htm

76

The vast majority of the fastest evolving genes are undetectable even when the cutoffs are extremely permissive.

Page 77: Phylogenetic analysis taken from avierstr avierstr and  es/MSAPhylogeny.htm

77

Conclusion

Page 78: Phylogenetic analysis taken from avierstr avierstr and  es/MSAPhylogeny.htm

78

The inverse relationship between evolutionary rate and gene age is an

artifact caused by our inability to detect similarity when genetic distances are

large.

Page 79: Phylogenetic analysis taken from avierstr avierstr and  es/MSAPhylogeny.htm

79

• Since genetic distance increases with time of divergence and rate of evolution, it is difficult to identify homologs of fast evolving genes in distantly related taxa.

• Thus, fast evolving genes may be misclassified as “new”.

Page 80: Phylogenetic analysis taken from avierstr avierstr and  es/MSAPhylogeny.htm

80

So, the only conclusion that can be drawn from Alba and Castresana’s

study is that

Slowly evolving genesevolve slowly!!!