practical on phylogenetic trees based on sequence alignments

26
Practical on phylogenetic trees based on sequence alignments Kyrylo Bessonov November 26th, 2013

Upload: chelsa

Post on 04-Jan-2016

55 views

Category:

Documents


3 download

DESCRIPTION

Practical on phylogenetic trees based on sequence alignments. Kyrylo Bessonov November 26th, 2013. Talk plan. How to build phylogenetic trees of types Unrooted Rooted Context comparison of viral proteins of dengue virus Examples on phylogenetic tree building Dengue virus. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Practical on phylogenetic trees based on  sequence  alignments

Practical on phylogenetic trees based on sequence alignments

Kyrylo BessonovNovember 26th, 2013

Page 2: Practical on phylogenetic trees based on  sequence  alignments

Talk plan

• How to build phylogenetic trees of types– Unrooted– Rooted

• Context – comparison of viral proteins of dengue virus

• Examples on phylogenetic tree building– Dengue virus

Page 3: Practical on phylogenetic trees based on  sequence  alignments

Building a phylo tree using ape

• Ape - Analyses of Phylogenetics and Evolution– Functions to create and manipulate phylo trees– Graphical exploration of phylogenetic data

• To build a phylogenetic tree– Download protein sequences from DB– Align sequences– Calculate pairwise distance using ape– Visualize a phylogenetic tree

Page 4: Practical on phylogenetic trees based on  sequence  alignments

Building an unrooted phylogenetic tree (1)

#install req. librariesinstall.packages("seqinr")install.packages("muscle")install.packages("ape")library("seqinr")library("muscle")library("ape")

multipleSeqAlignment <- function (seqnames, seqs){

#umax is an object of class fasta from muscle packagefasta_seqs_Object=umax;

tmp=data.frame(V1=rep(0,length(seqs)),V2=rep(0,length(seqs)))

for(i in 1:length(seqs)){tmp[i,1]=seqnames[i]tmp[i,2]=paste(seqs[[i]],collapse="")

}

fasta_seqs_Object$seqs=tmp

#multiple sequence alignment#remove conflicting ape library from the memorytry(detach("package:ape"), silent=T)alignment=muscle(seqs=fasta_seqs_Object, out = NULL)alignment_ape=ape::as.alignment(matrix(alignment$seqs[,2]))alignment_ape$nam=alignment$seqs[,1]

return (alignment_ape)}

Page 5: Practical on phylogenetic trees based on  sequence  alignments

Building an unrooted phylogenetic tree (2)

#main part of the codechoosebank("swissprot") #selects database for query

seqnames <- c("P06747", "P0C569", "O56773", "Q5VKP1")

seqs=list()for(i in 1:length(seqnames)){query <- query(paste("AC=",seqnames[i],sep=""))seqs[i]=getSequence(query)}#multipleSeqAlignment() is defined on previous slidealignment_ape <- multipleSeqAlignment(seqnames, seqs);mydist <- dist.alignment(alignment_ape)#nj() performs the neighbor-joining tree estimation by Saitou and Nei mytree <- nj(mydist)mytree$tip.label=c("Q5VKP1-\nWestern Caucasian bat virus\nphosphoprotein","P06747-\nrabies virus\nphosphoprotein","P0C569-\nMokola virus\nphosphoprotein","O56773-\nLagos bat virus\nphosphoprotein")

plot.phylo(mytree,type="u", edge.color = "blue", edge.width = 3, cex=0.8, no.margin=T, srt=50)

Page 6: Practical on phylogenetic trees based on  sequence  alignments

Unrooted Phylogenetic Tree

• Phylogenetic tree showing distance between 4 protein viral sequences

• the genetic distance between O56773 and P0C569 is the smallest

Page 7: Practical on phylogenetic trees based on  sequence  alignments

Unrooted phylogenetic tree (1)

• The lengths of the branches in the plot of the tree are proportional to the amount of evolutionary change (estimated by number of mutations) along the tree branches

• This is an unrooted phylogenetic tree as it does not contain an outgroup sequence, that is a sequence of a protein that is known to be more distantly related to the other proteins in the tree than they are to each other.

Page 8: Practical on phylogenetic trees based on  sequence  alignments

Unrooted phylogenetic tree(2)

• As a result, we cannot tell which direction evolutionary time ran in along the internal branches of the tree. For example, we cannot tell whether the node representing the common ancestor of (O56773, P0C569) was an ancestor of the node representing the common ancestor of (Q5VKP1, P06747), or the other way around.

Page 9: Practical on phylogenetic trees based on  sequence  alignments

Distance matrix

• Inspecting calculated distance matrix between aligned sequences confirms results seen in phylogenetic tree

• Closest pair is O56773 and P0C559 proteins

Q5VKP1 P06747 P0C569

P06747 0.49

P0C569 0.48 0.45

O56773 0.50 0.46 0.41

Page 10: Practical on phylogenetic trees based on  sequence  alignments

Rooted phylogenetic tree

• In order to convert the unrooted tree into a rooted tree, we need to add an outgroup sequence– Outgroup

• a taxon outside the group of interest• will branch off at the base of phylogeny• Caenorhabditis elegans (UniProt accession Q10572 and Caenorhabditis

remanei (UniProt E3M2K8)

• If we were to build a phylogenetic tree of the Fox-1 homologues in verterbrates, the distantly related sequence from worms would probably be a good choice of outgroup, since the protein is from a different taxa/group (worms)

Page 11: Practical on phylogenetic trees based on  sequence  alignments

Building an rooted phylogenetic tree (1)

#BUILDIN ROOTED TREE OF PROTEIN SEQUNCES (FOX1)#Q9NWB1 - Human#Q17QD3 - Cow#Q95KI0 - Monkey#A1A5R1 - Rat#Q10572 - Worm C.elegans(Root)#E1G4K8 - Eye worm

seqnames <- c("Q9NWB1","Q17QD3","Q95KI0","A1A5R1","Q10572","E1G4K8")choosebank("swissprot") #selects database for queryseqs=list()for(i in 1:length(seqnames)){query <- query(paste("AC=",seqnames[i],sep=""))seqs[i]=getSequence(query)}

alignment_ape <- multipleSeqAlignment(seqnames, seqs);mydist <- dist.alignment(alignment_ape)

Page 12: Practical on phylogenetic trees based on  sequence  alignments

Building an rooted phylogenetic tree (2)

library("ape")mytree <- nj(mydist)mytree$tip.label=c("E1G4K8-Eye worm ", "Q10572-C.elegans(Root)", "A1A5R1-Rat", "Q9NWB1-Human", "Q17QD3-Cow", "Q95KI0-Monkey")myrootedtree <- root(mytree, outgroup="Q10572-C.elegans(Root)", r=TRUE)#Phylogenetic tree with 6 tips and 5 internal nodes.#Tip labels:#[1] "E1G4K8" "Q8WS01" "Q9VT99" "A8NSK3" "Q10572" "E3M2K8"#Rooted; includes branch lengths.plot.phylo(myrootedtree, edge.color = "blue", edge.width = 3 , type="p")

Page 13: Practical on phylogenetic trees based on  sequence  alignments

Rooted tree of FOX1 proteins

• The invertebrates are grouped together

• Worms form a distinct group yet with large genetic distance

• Human FOX1 is closest to monkey and cow sequences

outgroup (worms)

Page 14: Practical on phylogenetic trees based on  sequence  alignments

Distance matrix E1G4K8 Q10572 A1A5R1 Q9NWB1 Q17QD3Q10572 0.72 A1A5R1 0.75 0.63 Q9NWB1 0.72 0.62 0.44 Q17QD3 0.73 0.62 0.50 0.28 Q95KI0 0.73 0.61 0.49 0.28 0.14

• As expected, eye worms are the mostly distantly related species to vertebrates

• Cow and monkey have the closest relationship and the lowest genetic distance

Table legend:Q9NWB1 – Human Q95KI0 – Monkey Q10572 - Worm C.elegans (Root)Q17QD3 – Cow A1A5R1 – Rat E1G4K8 - Eye worm

Page 15: Practical on phylogenetic trees based on  sequence  alignments

Rooted tree

• Time runs from left to right

• Monkey, Cow and Human have common ancestor 3

• Ancestor 1 is common to ancestors 2 and 3

TIME

Page 16: Practical on phylogenetic trees based on  sequence  alignments

Exercises on phylogenetic tree building

• Q1. Calculate the genetic distances (i.e. genetic distance) between the following NS1 proteins from different Dengue virus strains: Dengue virus 1 NS1 protein (Uniprot ID: Q9YRR4), Dengue virus 2 NS1 protein (UniProt: Q9YP96), Dengue virus 3 NS1 protein (UniProt: B0LSS3), and Dengue virus 4 NS1 protein (UniProt: Q6TFL5). Which viruses are the most closely related, and which are the least closely related, based on the genetic distances? Note: Dengue virus causes Dengue fever, which is classified by the WHO as a neglected tropical disease. There are four main types of Dengue virus, Dengue virus 1, Dengue virus 2, Dengue virus 3, and Dengue virus 4.

• Q2. Build an unrooted phylogenetic tree of the NS1 proteins from Dengue virus 1, Dengue virus 2, Dengue virus 3 and Dengue virus 4, using the neighbour-joining algorithm. Which are the most closely related proteins, based on the tree?

Page 17: Practical on phylogenetic trees based on  sequence  alignments

• Q3. The Zika virus is related to Dengue viruses, but is not a Dengue virus, and so therefore can be used as an outgroup in phylogenetic trees of Dengue virus sequences. UniProt accession Q32ZE1 consists of a sequence with similarity to the Dengue NS1 protein, so seems to be a related protein from Zika virus. Build a rooted phylogenetic tree of the Dengue NS1 proteins based on an alignment, using the Zika virus protein as the outgroup. Which are the most closely related Dengue virus proteins, based on the tree? What extra information does this tree tell you, compared to the unrooted tree in Q2?

Exercises on phylogenetic tree building

Page 18: Practical on phylogenetic trees based on  sequence  alignments

Answers

Question 1:Summary of viral proteins and Uniprot accession numbers:Uniprot ID: Q9YRR4 Dengue virus 1 NS1 proteinUniProt: Q9YP96 Dengue virus 2 NS1 proteinUniProt: B0LSS3 Dengue virus 3 NS1 protein UniProt: Q6TFL5 Dengue virus 4 NS1 protein

seqnames <- c("Q9YRR4","Q9YP96","B0LSS3","Q6TFL5")choosebank("swissprot") #selects database for queryseqs=list()for(i in 1:length(seqnames)){query <- query(paste("AC=",seqnames[i],sep=""))seqs[i]=getSequence(query)}

alignment_ape <- multipleSeqAlignment(seqnames, seqs);mydist <- dist.alignment(alignment_ape);mydist

Page 19: Practical on phylogenetic trees based on  sequence  alignments

Answers

• Q1. The distance matrix is as follows

The most distant are Q9YP96(V2) and Q6TFL5(V4) with genetic distance of 0,33 while the most closely related are Q9YP96(V1) and BOLSS3(V3) with genetic distance of 0,227

Q6TFL5 Q9YRR4 Q9YP96Q9YRR4 0.306 Q9YP96 0.333 0.254 B0LSS3 0.297 0.230 0.227

Page 20: Practical on phylogenetic trees based on  sequence  alignments

Answers

Question 2:

library("ape")mytree <- nj(mydist)#plotting unrooted treeplot.phylo(mytree,type="u", edge.color = "blue", edge.width = 3, cex=1.2, no.margin=T, srt=0)#clean the sequences from gapsseqs_trim=seqsfor(i in 1:length(seqs)){

start=regexpr("DMGY", paste(seqs_trim[[i]],collapse="") ) [1]stop=regexpr("GEDG", paste(seqs_trim[[i]],collapse="") ) [1]seqs_trim[[i]]=seqs_trim[[i]][start:stop]

}alignment_ape <- multipleSeqAlignment(seqnames, seqs_trim);mydist <- dist.alignment(alignment_ape);mydistlibrary("ape")mytree <- nj(mydist)#plotting unrooted tree based on alignment of whole protein sequencesplot.phylo(mytree,type="u", edge.color = "blue", edge.width = 3, cex=1.2, no.margin=T, srt=0)

Page 21: Practical on phylogenetic trees based on  sequence  alignments

Question 2 (continued):

alignment_ape <- multipleSeqAlignment(seqnames, seqs_trim);mydist <- dist.alignment(alignment_ape);mydistlibrary("ape")mytree <- nj(mydist)

#tree based on the best aligned portionplot.phylo(mytree,type="u", edge.color = "blue", edge.width = 3, cex=1.2, no.margin=T, srt=0)

Answers

Page 22: Practical on phylogenetic trees based on  sequence  alignments

Answers• The resulting Q2 un-rooted treeThis un-rooted tree agrees with the genetic distance matrix calculated in Q1. The tree suggests that BOLSS3 and Q9YP96 are the mostly related proteins. To improve quality of the tree it is best to select region that has minimal number of gaps between protein sequences

Below you can see that there are regions with lots of gaps. Let’s build another tree based on the bolded(most conserved) region to see if it is the same

Q6TFL5 DMGCVVSWNGKELKC…KDQKAVHADMGYWIESSKNQTWQIEKASLIEVKTCLWPKTHTL…GMEIRPLSEKEENMVKSQVTAQ9YRR4 ------------------------DMGYWIESEKNETWKLARASFIEVKTCIWPKSHTL…GMEI-----------------Q9YP96 DSGCVVSWKNKELKC…KDNRAVHADMGYWIESALNDTWKIEKASFIEVKNCHWPKSHTL…GMEIRPLKEKEENLVNSLVTAB0LSS3 --------------------ASHADMGYWIESQKNGSWKLEKASLIEVKTCTWPKSHTL…------------------------

Alignment of proteins:Built using the full lengths of proteins

Page 23: Practical on phylogenetic trees based on  sequence  alignments

Answers

• The resulting tree looks the same but we had achieved overall better resolution between proteins

Q6TFL5 Q9YRR4 Q9YP96Q9YRR4 0.317 Q9YP96 0.317 0.264 B0LSS3 0.292 0.233 0.216 Built using the bolded region

Whole protein sequences used

Best aligned portion of protein sequences used

Q6TFL5 Q9YRR4 Q9YP96Q9YRR4 0.306Q9YP96 0.332 0.254B0LSS3 0.297 0.230 0.227

Page 24: Practical on phylogenetic trees based on  sequence  alignments

Answers

Question 3:

#Q3 building rooted tree based on Q89277 (yellow fever virus) as out grouplibrary("seqinr")library("muscle")library("ape")seqnames <- c("Q9YRR4","Q9YP96","B0LSS3","Q6TFL5", "Q89277")choosebank("swissprot") #selects database for queryseqs=list()for(i in 1:length(seqnames)){query <- query(paste("AC=",seqnames[i],sep=""))seqs[i]=getSequence(query)}alignment_ape <- multipleSeqAlignment(seqnames, seqs);mydist <- dist.alignment(alignment_ape);mydist

library("ape")mytree <- nj(mydist)myrootedtree <- root(mytree, outgroup="Q89277", r=TRUE)plot.phylo(myrootedtree ,type="p", edge.color = "blue", edge.width = 3, cex=1.2, no.margin=T, srt=0)

Page 25: Practical on phylogenetic trees based on  sequence  alignments

Answers

• Q3 asks to build a rooted tree using out-group yellow fever virus (Q89277)

• Most closely related viruses:– BOLSS3 and Q9YP96

• This rooted tree tells you which of the Dengue virus NS1 proteins branched off the earliest from the ancestors. Unrooted tree does not provide ancestry information (i.e. time sequence)

Q89277 Q6TFL5 Q9YRR4 Q9YP96Q6TFL5 0.523 Q9YRR4 0.511 0.306 Q9YP96 0.486 0.333 0.254 B0LSS3 0.487 0.297 0.230 0.227

outgroup

Page 26: Practical on phylogenetic trees based on  sequence  alignments

References

• Ape library for phylogenetic trees and ancestry with bootstrap methods http://cran.r-project.org/web/packages/ape/ape.pdf