phylogenetic trees • protein structure · phylogenetic analysis variable position conserved...
Post on 07-Aug-2020
0 Views
Preview:
TRANSCRIPT
Prof. Bystroff talks about BIOINFORMATICS
• Sequence database searching • Phylogenetic Trees • Protein Structure
1
hi
AAAGAGATTCTGCTAGCGGTCGGAGAGATGCTGCAGCGAGTCGGCC
AAAGAGATTCTGCTAGCGGTCGGAGAGATGCTGCAGCGAGTCGGCC
AAAGAGATTCTGCTAGCGGTCGG
AGAGATGCTGCAGCGAGTCGGCC
5
Protein sequence alignment uses a "substitution matrix".
Sequence 1
Sequ
ence
2
Find the best pathway through the substitution scores, and you have an alignment
6"dynamic programming" algorithm.
BLAST searches millions of sequences
GenBank contains over 162 million sequences!!
The score for each should be the optimal alignment score. Even if we can do 1 per millisecond, it would take 45 hours to do one search. BLAST usually finishes in under a minute.
How does BLAST do it so fast?
BLAST precalculates all triplet hits in the database.
PGQ
...
PGQ PGR PGS ... PGT PGV PGWPGY PAQ PCQPDQ PEQ PFQ ......
BLAST uses an expansion table to allow for near perfect matches
My sequence has this triplet BLAST saves a
lookup table (called an INDEX) for all of the near identity triplet location in the whole database.
This is all done when BLAST is set up, before any searches are carried out.
BLAST finds diagonal arrangements of triplet hits
triplet hits in one database protein
Hits are joined by extension
BLAST scores only the best hits (saves time)
BLAST connects the diagonals (FASTA algorithm)
This protein is given a score, and we save it for later only if the score passes a cutoff.
Re-scoring.
Convert score to a e-value*.
Rank by e-value.
cutoff
*later...
11
Protein Databases available for BLAST search
Go to BLAST search page (i.e. blastp) , select a database to search and then select ? to learn a little about that database.
12
Protein Databases available for BLAST search
On BLAST search page, select.a database to search and then select ? to learn a little about that database.
13
Protein Databases available for BLAST search
On BLAST search page, select.a database to search and then select ? to learn a little about that database.
14
forms of BLASTBLAST query database
blastn nucleotide nucleotide
blastp protein protein
tblastn protein translated DNA
blastx translated DNA protein
tblastx translated DNA translated DNA
psi-blast protein, profile protein
phi-blast pattern protein
How significant is that?
Please give me a number for...
...how likely the data would not have been the result of chance,...
...as opposed to... ...a specific
inference.
e-value
A better metric of significance.E-value = p-value x (number of attempts)
16
Scores from random alignments are used to calculate the p-value of an alignment score
score--->
freq
p-value of x = ∫normalized normal distribution fit to random scoresx
∞
x
p-value is the significance of one (1) alignment score.
e-value is the significance of one score of many tries.
Searching a database of 162 million sequences for one hit is like trying 162 million times to get one good alignment. The number of times you will see that score by chance is the p-value times 162 million!
e-value = p-value * 162,000,000 (GenBank search)
Pop-quizBLAST HIT.................... e-value1. annotation 3.0 2. annotation 3.03. annotation 3.0 4. annotation 3.05. annotation 3.0 6. annotation 3.07. annotation 3.0 8. annotation 3.09. annotation 3.0 10. annotation 3.0
How many of the above 10 hits are the expected to be by chance?
Pop-quizBLAST HIT.................... e-value1. annotation 1.0 2. annotation 2.03. annotation 3.0 4. annotation 4.05. annotation 5.0 6. annotation 6.07. annotation 7.0 8. annotation 8.09. annotation 9.0 10. annotation 10.0
How many of the above 10 hits are the expected to be by chance?
Pop-quizBLAST HIT.................... e-value1. annotation 0.0 2. annotation 0.013. annotation 0.01 4. annotation 0.015. annotation 0.02 6. annotation 0.027. annotation 0.02 8. annotation 0.029. annotation 0.02 10. annotation 10.0
How many of the above 10 hits are the expected to be by chance?
Bioinformatics
22
• Sequence database searching • Phylogenetic Trees • Protein Structure
Evolutionary time
A
B
C
D
11
1
6
3
5
genetic change
A
B
C
D
time
A
B
C
D
no meaning
Cladogram Phylogram Ultrametric tree
(D:5,(A:1,(C:1,B:6):1):3)
parenthesis (Newick) notation has both labels and distances.
A multiple sequence alignment is made using many pairwise sequence alignments
Multiple Sequence Alignment
Construct a distance-based tree
97 8177
82 59 3280 55 3190 65 40
61 4233
ABCDEF
A B C D E F ABCDEF
Draw tree heredistances
Life is not strictly a tree -- horizontal gene transfer
26
BF Smets, T Barkay (2005) “Horizontal gene transfer: perspectives at a crossroads of scientific disciplines” Nature Reviews Microbiology.
Discrete Steps Needed for Stability of Gene TransferStably incorporating horizontally transferred genes into a recipient genome involves five distinct steps (Fig. 1). 1. First, a particular segment of DNA or RNA is prepared for transfer from the donor strain through one of several processes, including excision and circularization of conjugative transposons, initiation of conjugal plasmid transfer by synthesis of a mating pair-formation protein complex, or packaging of nucleic acids into phage virions. 2. Next, the segment is transferred either by conjugation, which requires contact between the donor and recipient cells, or by transformation and transduction without direct contact. 3. During the third step, genetic material enters the recipient cell, where cell exclusion may abort the transfer. 4. Otherwise, during the fourth step, the incoming gene is integrated into the recipient genome by legitimate or sitespecific recombination or by plasmid circularization and complementary strand
synthesis. Barriers to transfer during this step come from restriction modification systems, failure to integrate and replicate within the new host genome, and incompatibility with resident plasmids. 5. In the final step, transferred genes are replicated as part of the recipient genome and transmitted to daughter cells in stable fashion over successive generations. Researchers from different disciplines tend to focus on specific stages within this five-step sequence. Thus, evolutionary biologists who examine microbial genomes for evidence of past transfers tend to look at HGTs from the perspective of step five. Molecular biologists are more likely to examine the details of the transfer events, while microbial ecologists look more broadly when they describe the magnitude and diversity of the mobile gene pool, sometimes called the mobilome.
Sequence homology trees are complicated by paralogy
Orthologs: homologs originating from a speciation eventParalogs: homologs originating from a gene duplication event.
clam
duck
crab
fish
clam
A
duck
A
crab
Bfis
h A
duck
B
fish
B
Sequence tree !!cl
am A
crab
B
duck
Adu
ck B
fish
Afis
h B
duplication
speciation
speciationgene loss
True Species tree reconciled trees
Use orthologs• To make the right inferences about
evolution, make sure your phylogenetic tree is composed of orthologs
How do you know it's an ortholog?1. It has the same function in both species.2. It has about the same number of differences across species as other orthologs.3. You don't.
Functional inference from multiple sequence alignments
ConservedNot conserved
folding
function
Functional inference from multiple sequence alignments
ConservedNot conserved
folding
function
stability
kineticsenzyme activity
binding
post-translational modification
ConservedNot conserved
folding
function
stability
kineticsenzyme activity
binding
post-translational modification
species differences
Next time:
• Visit rcsb.org
• Try visualizing a protein.
• Locate a residue that is conserved across all species in a BLAST search.
• Locate one that is conserved except in one species. What might be its function?
41
2 3
1 2
34
41
2 3
1 2
34
41
2 3
1 2
34
41
2 3
1 2
34
41
2 3
1 2
34
41
2 3
1 2
34
41
2 3
1 2
34
41
2 3
1 2
34
41
2 3
1 2
34
2
3
4
1
����� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ���������� �� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �
���� �� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ������ � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �
���������� �� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ����� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ������� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ������ � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ������ � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��������� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �
���� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ������� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �
��� ��������� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ����� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �
��! � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ������ � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �
� �"���� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �# ������� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �$� ���������� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ������������� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �
������ � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ����� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �%��������� �" � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �
�� ������� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �
����
Mm1:C362 Mm1:S393 Mm1:C499 Mm2:C210 Mm3:C96 Mm3:C197 Mm4:C114 Mm4:C170 Mm4:C233• • • • • • • • •
5
*
Phylogenetic analysis
variable position conserved
single position conserved
Transmembrane Cys can still for self-reacting SS when it mutates to a new position. Therefore, variable position conservd Cys are self-reacting. Single
position conserved cys are cross-reacting.
����� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ���������� �� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �
���� �� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ������ � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �
���������� �� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ����� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ������� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ������ � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ������ � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��������� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �
���� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ������� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �
��� ��������� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ����� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �
��! � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ������ � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �
� �"���� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �# ������� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �$� ���������� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ������������� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �
������ � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ����� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �%��������� �" � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �
�� ������� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �
����
Mm1:C362 Mm1:S393• •
In this case, mammals found to be missing conserved cysteines in the sperm specific calcium channel CatSper were species that lacked sperm competition.
http://etetoolkit.org
Format of sequence alignment for ETE tree
>Squirrel FLVVCLNT---CIFLCIYV---LTLMFTCLF---LLRICRVLR---VSICTSEFA---LGFCLFGI---LTILVCEV---LVHVCMAV---ICITQDGW >Beaver FVTVCLNT---CIFLCIYV---LILMFTCMF---LLRICRVLR---VSICTSEFF---LGFCLFGI---LTILICEV---LVHVCMAV---ICITQDGW >Blind mole rat FLVVCLNT---SIFLCIYI---LTLMFTCLF---LLRICRVLK---VSTYACEFF---LGFCLFGV---LTILTCEV---LVHVCMAV---ICITQDGW >Mouse FIVVCLNT---SIFLSIYV---LTLMFTCLF---LLRVCRVLR---VSVYVCEFL---LGFCLFGV---LTILICEV---LVHVCMAV---ICITQDGW >Pika FLVICLNT---CIFLSIYV---LTLMFTCLF---LLRICRVLR---VSIYASEFS---LGFCLFGT---LTILICEV---LLHVCMSV---ICITQDGW >Rabbit FLVVCLNT---CIFLCIYM---FVLMFTCLF---LLRICRVLR---VSIYASEFS---LGFCLFGA---LTILFCEV---LLHVCMAV---ICITQDGW >Gibbon FFVVCLNT---SIFFCIYV---LILMFTCLF---LLRICRVLR---VSICTSELF---LGFCLFGS---LTILICEV---LVHVCMAV---ICITQDGW >Monkey FFIVCLNT---SIFFCIYV---LILMFTCLF---FLRICRVLR---VSICTSELA---LGFCLFGS---LTILICEV---LVHVCMAV---ICITQDGW >Bushbaby FFIICLNT---CIFFCIYV---LILMFTCLF---FLRICRVLR---VGIYSAEFY---LGFCLFGV---LSILVCEV---LIHVCMAV---ICITQDGW >Lemur FFIICLNT---AIFFSIYL---LILMFTCLF---FLRICRVLR---VSIYSSEFV---LGFCLFGV---LTILICEV---LVHVCMAV---ICITQDGW >Sifaka FFVICLNT---SIFFCIYV---LILMFTCLF---FLRICRVLR---VSIYSSEFS---LGFCLFGV---LTILICEV---LVHVCMAV---ICITQDGW
FASTA format. Output by most alignment programs and packages.
http://etetoolkit.org
Format of tree for ETE tree
(Beaver:0.106861,(('Blind mole rat':0.0870003,Mouse:0.128141):0.0287991,('Naked mole rat':0.316691,((Pika:0.0584227,Rabbit:0.0514835):0.0419969,(((Gibbon:0.062089,Monkey:0.0723263):0.0501071,(Bushbaby:0.104971,(Lemur:0.0853643,Sifaka:0.0510973):0.0395091):0.00449631):0.0111712,((Marsupials:0.390453,((Manatee:0.0517099,Mole:0.11989):0.00669083,'Elephant shrew':0.143216):0.0258566):0.00193119,('Star-nosed mole':0.111938,((Alpaca:0.102393,Pig:0.0618056):0.010159,(Leopard:0.0585696,('Brown bat':0.108369,('Fruit bat':0.0725375,('Horseshoe bat':0.0640651,'Leaf-nosed bat':0.0497925):0.0219586):0.00813098):0.00277678):0.00371913):0.00419253):0.00732498):0.0142435):0.00485045):0.00731201):0.00875856):0.00305836,Squirrel:0.0786295);
Newick format. Output UGENE, NCBI, most tree tools.
http://etetoolkit.org
Protein Data Bank
• rcsb.org
• 4ms2 a voltage-gated calcium channel.
1) visualize overall structure in NGL2) view ligands3) view electron density3) find an amino acid. Zoom in.4) Homology modeling.
38
superposed homologs
39
40
Homology modeling in a nutshell.
ACDEFG....HIKLMNPQRSTVWY ||:|| || :| | ||||: .CDDFGACDGHIYIM..QQSTVWF
target
template
Modeling action... • Add Ala to the N-terminal Cys using energy minimization.• • Keep the conserved Phe sidechain and backbone. • Cut out the four residue insertion and connect G to H. • Switch non-similar sidechains Y->K. Possibly move backbone.. • Cut at M-Q, insert two residues, Asn-Pro. • Switch similar sidechains F->Y. Keep backbone fixed..
ALIGNMENT
Automatic homology modeling by SWISS-MODEL
42
https://swissmodel.expasy.org/interactive
Next time:
• Read CatSper paper Bystroff, C. (2018). Intramembranal disulfide cross-linking elucidates the super-quaternary structure of mammalian CatSpers. Reproductive biology, 18(1), 76-82. Chicago
Read at least one part of the paper in detail. Bring a comment, suggestion, or question to class 2/19.
top related