introduction to blast with protein sequences
TRANSCRIPT
![Page 1: Introduction to BLAST with Protein Sequences](https://reader031.vdocuments.us/reader031/viewer/2022020911/6201ed85bcd4443c3b6e7cfa/html5/thumbnails/1.jpg)
1
Introduction to BLAST with
Protein Sequences
Utah State University – Spring 2014
STAT 5570: Statistical Bioinformatics
Notes 6.2
![Page 2: Introduction to BLAST with Protein Sequences](https://reader031.vdocuments.us/reader031/viewer/2022020911/6201ed85bcd4443c3b6e7cfa/html5/thumbnails/2.jpg)
2
References
Chapter 2 of Biological Sequence Analysis
(Durbin et al., 2001)
http://blast.ncbi.nlm.nih.gov/Blast.cgi
![Page 3: Introduction to BLAST with Protein Sequences](https://reader031.vdocuments.us/reader031/viewer/2022020911/6201ed85bcd4443c3b6e7cfa/html5/thumbnails/3.jpg)
3
Basic Local Alignment Search Tool
Finds regions of similarity between a query
sequence and a growing database
Breaks query & database into fragments:
(words)
Starts out by aligning fragments, then
extends alignment
Not optimal alignment, but a good heuristic
approachsimplified, rule of thumb
![Page 4: Introduction to BLAST with Protein Sequences](https://reader031.vdocuments.us/reader031/viewer/2022020911/6201ed85bcd4443c3b6e7cfa/html5/thumbnails/4.jpg)
4
Recall FASTA file from pairseqsim library
library(Biostrings)
f1 <- "C://folder//ex.fasta" # see slide 20 of Notes 6.1
ff <- readAAStringSet(f1, "fasta")
as.character(ff[1])
At1g01010 NAC domain protein, putative
MEDQVGFGFRPNDEELVGHYLRNKIEGNTSRDVEVAISEVNICSYDPWNLRFQSKYKSRDA
MWYFFSRRENNKGNRQSRTTVSGKWKLTGESVEVKDQWGFCSEGFRGKIGHKRVLVFLDGR
YPDKTKSDWVIHEFHYDLLPEHQRTYVICRLEYKGDDADILSAYAIDPTPAFVPNMTSSAG
SVVNQSRQRNSGSYNTYSEYDSANHGQQFNENSNIMQQQPLQGSFNPLLEYDFANHGGQWL
SDYIDLQQQVPYLAPYENESEMIWKHVIEENFEFLVDERTSMQQHYSDHRPKKPVSGVLPD
DSSDTETGSMIFEDTSSSTDSVGSSDEPGHTRIDDIPSLNIIEPLHNYKAQEQPKQQSKEK
VISSQKSECEWKMAEDSIKIPPSTNTVKQSWIVLENAQWNYLKNMIIGVLLFISVISWIIL
VG
![Page 5: Introduction to BLAST with Protein Sequences](https://reader031.vdocuments.us/reader031/viewer/2022020911/6201ed85bcd4443c3b6e7cfa/html5/thumbnails/5.jpg)
5
We’ll start here, with
protein BLAST:
(blastp)
Links to other
functions, including
bl2seq for pairwise
alignment (similar to
Biostrings, but less
useful output)
http://blast.ncbi.nlm.nih.gov/Blast.cgi
![Page 6: Introduction to BLAST with Protein Sequences](https://reader031.vdocuments.us/reader031/viewer/2022020911/6201ed85bcd4443c3b6e7cfa/html5/thumbnails/6.jpg)
6
paste sequence
here
(or accession /
gi number)
nr database is
most general
click here to start
search
for querying
specific
subsequences
set parameters (next slide)
![Page 7: Introduction to BLAST with Protein Sequences](https://reader031.vdocuments.us/reader031/viewer/2022020911/6201ed85bcd4443c3b6e7cfa/html5/thumbnails/7.jpg)
7
select scoring
scheme here
# “significant”
matches
expected just
by chance in
database
fragment sizes to
begin alignments
![Page 8: Introduction to BLAST with Protein Sequences](https://reader031.vdocuments.us/reader031/viewer/2022020911/6201ed85bcd4443c3b6e7cfa/html5/thumbnails/8.jpg)
8
Wait for Results
Large database to search- time also: “traffic”-dependent
This can be automated somewhat(write scripts to deliver & format results)
Can arrange for e-mail to link of final results
![Page 9: Introduction to BLAST with Protein Sequences](https://reader031.vdocuments.us/reader031/viewer/2022020911/6201ed85bcd4443c3b6e7cfa/html5/thumbnails/9.jpg)
9
Visualize
alignments,
colored by
score
Scroll down
or click on
alignments
to see more
information
most
interested in
top scoring
alignments
![Page 10: Introduction to BLAST with Protein Sequences](https://reader031.vdocuments.us/reader031/viewer/2022020911/6201ed85bcd4443c3b6e7cfa/html5/thumbnails/10.jpg)
10
We already knew
this one; probably
more interested in
other “good”
alignments
![Page 11: Introduction to BLAST with Protein Sequences](https://reader031.vdocuments.us/reader031/viewer/2022020911/6201ed85bcd4443c3b6e7cfa/html5/thumbnails/11.jpg)
11
Normalized alignment score
The expected # of sequences to score higher than this one in the
database search, just by chance
links to
gene
info
![Page 12: Introduction to BLAST with Protein Sequences](https://reader031.vdocuments.us/reader031/viewer/2022020911/6201ed85bcd4443c3b6e7cfa/html5/thumbnails/12.jpg)
12
Another “good” alignment
![Page 13: Introduction to BLAST with Protein Sequences](https://reader031.vdocuments.us/reader031/viewer/2022020911/6201ed85bcd4443c3b6e7cfa/html5/thumbnails/13.jpg)
Distribution of maximum score
13
Think of alignment score as sum of indep. random
variables.
Then the distribution of this score can be considered
approximately normal. (why?)
The asymptotic distribution of the maximum MN of a
series of N independent normal random variables is
the extreme value distribution:
Use this to calculate the probability that the best
match from a search of N unrelated sequences has
score greater than our observed largest score S.
x
N eNKMP exp
![Page 14: Introduction to BLAST with Protein Sequences](https://reader031.vdocuments.us/reader031/viewer/2022020911/6201ed85bcd4443c3b6e7cfa/html5/thumbnails/14.jpg)
14
E-value of a local, ungapped alignment
system scoringfor scale natural
size space searchfor scale natural
residues) in database, of length(or sequence database of length
sequencequery of length
where
,exp
: theis Sleast at of
scores withs HSP'ofnumber expected theThen
).alignments local ungapped scoring- top theof (one
pair"segment scoring-high"a be Let HSP
K
n
m
SnmKE
E-value
depend on qa and s(a,b)
2log
log
:is scorebit theaside,an As
KSS
![Page 15: Introduction to BLAST with Protein Sequences](https://reader031.vdocuments.us/reader031/viewer/2022020911/6201ed85bcd4443c3b6e7cfa/html5/thumbnails/15.jpg)
15
Statistical Significance of an Alignment
E
E
a
eXPXP
eE
EXP
a
EEaXP
EXEEPoissonX
X
1010
!0)exp(0
!)exp(
][ );(~
Sleast at score withs HSP'of #
0
Probability of observing a result at least as extreme as what was
seen, just by chance, when the sequences are all unrelated (Null)
= P-value
Still need to look at biological similarity
![Page 16: Introduction to BLAST with Protein Sequences](https://reader031.vdocuments.us/reader031/viewer/2022020911/6201ed85bcd4443c3b6e7cfa/html5/thumbnails/16.jpg)
16
Estimated values for λ & K
H = relative “entropy” of target
and background residue
frequencies
(like the average information
available per position to
distinguish an alignment from
chance)
![Page 17: Introduction to BLAST with Protein Sequences](https://reader031.vdocuments.us/reader031/viewer/2022020911/6201ed85bcd4443c3b6e7cfa/html5/thumbnails/17.jpg)
17
![Page 18: Introduction to BLAST with Protein Sequences](https://reader031.vdocuments.us/reader031/viewer/2022020911/6201ed85bcd4443c3b6e7cfa/html5/thumbnails/18.jpg)
18
Sub-tree
![Page 19: Introduction to BLAST with Protein Sequences](https://reader031.vdocuments.us/reader031/viewer/2022020911/6201ed85bcd4443c3b6e7cfa/html5/thumbnails/19.jpg)
19
Other resources to consider
Altschul et al. (1990) “Basic Local Alignment
Search Tool.” Journal of Molecular Biology
215:403-410.
Mitrophanov & Borodovsky (2006) “Statistical
significance in biological sequence analysis.”
Briefings in Bioinformatics 7(1):2-24
![Page 20: Introduction to BLAST with Protein Sequences](https://reader031.vdocuments.us/reader031/viewer/2022020911/6201ed85bcd4443c3b6e7cfa/html5/thumbnails/20.jpg)
20
Summary
BLAST
look at finding global alignment
from smaller, local alignments
E-value
expected number of better scores
just by chance,
when sequences are not related
Statistical significance ≠ biological relevance