blast 2.0 details
DESCRIPTION
Blast 2.0 Details. The Filter Option: process of hiding regions of (nucleic acid or amino acid) sequence having characteristics that frequently lead to spurious high scores typically involves the removal of repeated or low complexity regions - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Blast 2.0 Details](https://reader030.vdocuments.us/reader030/viewer/2022033106/56814457550346895db0f28a/html5/thumbnails/1.jpg)
![Page 2: Blast 2.0 Details](https://reader030.vdocuments.us/reader030/viewer/2022033106/56814457550346895db0f28a/html5/thumbnails/2.jpg)
![Page 3: Blast 2.0 Details](https://reader030.vdocuments.us/reader030/viewer/2022033106/56814457550346895db0f28a/html5/thumbnails/3.jpg)
![Page 4: Blast 2.0 Details](https://reader030.vdocuments.us/reader030/viewer/2022033106/56814457550346895db0f28a/html5/thumbnails/4.jpg)
![Page 5: Blast 2.0 Details](https://reader030.vdocuments.us/reader030/viewer/2022033106/56814457550346895db0f28a/html5/thumbnails/5.jpg)
![Page 6: Blast 2.0 Details](https://reader030.vdocuments.us/reader030/viewer/2022033106/56814457550346895db0f28a/html5/thumbnails/6.jpg)
![Page 7: Blast 2.0 Details](https://reader030.vdocuments.us/reader030/viewer/2022033106/56814457550346895db0f28a/html5/thumbnails/7.jpg)
![Page 8: Blast 2.0 Details](https://reader030.vdocuments.us/reader030/viewer/2022033106/56814457550346895db0f28a/html5/thumbnails/8.jpg)
![Page 9: Blast 2.0 Details](https://reader030.vdocuments.us/reader030/viewer/2022033106/56814457550346895db0f28a/html5/thumbnails/9.jpg)
![Page 10: Blast 2.0 Details](https://reader030.vdocuments.us/reader030/viewer/2022033106/56814457550346895db0f28a/html5/thumbnails/10.jpg)
![Page 11: Blast 2.0 Details](https://reader030.vdocuments.us/reader030/viewer/2022033106/56814457550346895db0f28a/html5/thumbnails/11.jpg)
![Page 12: Blast 2.0 Details](https://reader030.vdocuments.us/reader030/viewer/2022033106/56814457550346895db0f28a/html5/thumbnails/12.jpg)
![Page 13: Blast 2.0 Details](https://reader030.vdocuments.us/reader030/viewer/2022033106/56814457550346895db0f28a/html5/thumbnails/13.jpg)
![Page 14: Blast 2.0 Details](https://reader030.vdocuments.us/reader030/viewer/2022033106/56814457550346895db0f28a/html5/thumbnails/14.jpg)
![Page 15: Blast 2.0 Details](https://reader030.vdocuments.us/reader030/viewer/2022033106/56814457550346895db0f28a/html5/thumbnails/15.jpg)
![Page 16: Blast 2.0 Details](https://reader030.vdocuments.us/reader030/viewer/2022033106/56814457550346895db0f28a/html5/thumbnails/16.jpg)
![Page 17: Blast 2.0 Details](https://reader030.vdocuments.us/reader030/viewer/2022033106/56814457550346895db0f28a/html5/thumbnails/17.jpg)
![Page 18: Blast 2.0 Details](https://reader030.vdocuments.us/reader030/viewer/2022033106/56814457550346895db0f28a/html5/thumbnails/18.jpg)
![Page 19: Blast 2.0 Details](https://reader030.vdocuments.us/reader030/viewer/2022033106/56814457550346895db0f28a/html5/thumbnails/19.jpg)
![Page 20: Blast 2.0 Details](https://reader030.vdocuments.us/reader030/viewer/2022033106/56814457550346895db0f28a/html5/thumbnails/20.jpg)
Blast 2.0 Details
• The Filter Option:– process of hiding regions of (nucleic acid or amino
acid) sequence having characteristics that frequently lead to spurious high scores
– typically involves the removal of repeated or low complexity regions
– The SEG program is used to mask or filter LCRs in amino acid queries.
– The DUST program is used to mask or filter LCRs in nucleic acid queries
– More than half of the proteins in the database contain at least one low complexity region
![Page 21: Blast 2.0 Details](https://reader030.vdocuments.us/reader030/viewer/2022033106/56814457550346895db0f28a/html5/thumbnails/21.jpg)
SEG Filter Example
Default filtering option in BLAST 2.0 automatically converts low complexity sequences into X's which can be seen in the query line of the alignments
![Page 22: Blast 2.0 Details](https://reader030.vdocuments.us/reader030/viewer/2022033106/56814457550346895db0f28a/html5/thumbnails/22.jpg)
PSI-Blast
• Position Specific Iterated BLAST• an automated, easy-to-use version of a "profile"
search, – a sensitive way to look for sequence homologues
• Intuition: substitution matrices should be specific to a particular site. Penalize alanine glycine more in a helix
![Page 23: Blast 2.0 Details](https://reader030.vdocuments.us/reader030/viewer/2022033106/56814457550346895db0f28a/html5/thumbnails/23.jpg)
PSI-Blast: Outline
• Algorithm:– First perform a gapped BLAST database search– PSI-BLAST uses information from significant alignments to
construct a position-specific score matrix (PSSM), – PSSM replaces the query sequence for the next round of
database searching. – PSI-BLAST is iterated until no new significant alignments are
found. • Details:
– Set initial thresholds high. Inspect each iteration's result for suspicious sequences.
– Do several iterations (~5), or until no new sequences are found– Even if only looking for a small set of sequences, make the initial
search very broad • First, use NR with up to 5 iterations to set PSSM• Then use that PSSM to search in restricted domain
![Page 24: Blast 2.0 Details](https://reader030.vdocuments.us/reader030/viewer/2022033106/56814457550346895db0f28a/html5/thumbnails/24.jpg)
![Page 25: Blast 2.0 Details](https://reader030.vdocuments.us/reader030/viewer/2022033106/56814457550346895db0f28a/html5/thumbnails/25.jpg)
PSI-Blast: Details• To calculate profile for position 108: only shaded regions are used
To calculate profile at position i, pseudo-counts are used
![Page 26: Blast 2.0 Details](https://reader030.vdocuments.us/reader030/viewer/2022033106/56814457550346895db0f28a/html5/thumbnails/26.jpg)
PSI-BLAST Caveats
• Good:– Increased ability to find distant homologues – If the sequences used to construct PSSMs are all homologous,
the sensitivity at a given specificity improves significantly.• Bad:
– If non-homologous sequences are included in the PSSMs, they are “corrupted.” Then they pull in more non-homologous sequences, and become worse than generic
• Advice:– Special care to prevent non-homologous sequences from being
included in the PSSM calculation.• When in doubt, leave it out!• Examine sequences with moderate similarity carefully.
– Be particularly cautious about matches to sequences with highly biased amino acid content
![Page 27: Blast 2.0 Details](https://reader030.vdocuments.us/reader030/viewer/2022033106/56814457550346895db0f28a/html5/thumbnails/27.jpg)
Database Homology Search
• Homology search– For genes/RNAs which do not encode proteins
• relatively inefficient at identifying highly diverged sequences
– For genes which encode proteins • protein-protein searches are significantly better
– (two mRNA sequences might only be ~40% identical at the nucleotide level, but could be 70% similar in the proteins they encode)
• Rules of thumb:– 80% similarity implies same structure and function– highly diverged homologs could have down to 25% similarity– the "twilight zone" in the range of 20%: judgement about
significant similarity is quite difficult – distantly related homologs may lack significant similarity
![Page 28: Blast 2.0 Details](https://reader030.vdocuments.us/reader030/viewer/2022033106/56814457550346895db0f28a/html5/thumbnails/28.jpg)
Database Homology Search
• E-values:– expected number of sequences in the database which would
achieve a given score– are more useful than the raw or bit scores or percentage identity – Score of 0.001 is a standard threshold (unless sequence is
biased – e.g. low complexity)– Scores below 10-50 are highly significant.
• Caveats with low E-values:– while the evolutionary relationship is highly likely, it does not
necessarily imply identical function (multi-domain proteins)– if the score is extremely low AND the alignment covers the
length of both sequences, then they would share related function
![Page 29: Blast 2.0 Details](https://reader030.vdocuments.us/reader030/viewer/2022033106/56814457550346895db0f28a/html5/thumbnails/29.jpg)
![Page 30: Blast 2.0 Details](https://reader030.vdocuments.us/reader030/viewer/2022033106/56814457550346895db0f28a/html5/thumbnails/30.jpg)
![Page 31: Blast 2.0 Details](https://reader030.vdocuments.us/reader030/viewer/2022033106/56814457550346895db0f28a/html5/thumbnails/31.jpg)
![Page 32: Blast 2.0 Details](https://reader030.vdocuments.us/reader030/viewer/2022033106/56814457550346895db0f28a/html5/thumbnails/32.jpg)
![Page 33: Blast 2.0 Details](https://reader030.vdocuments.us/reader030/viewer/2022033106/56814457550346895db0f28a/html5/thumbnails/33.jpg)
![Page 34: Blast 2.0 Details](https://reader030.vdocuments.us/reader030/viewer/2022033106/56814457550346895db0f28a/html5/thumbnails/34.jpg)
![Page 35: Blast 2.0 Details](https://reader030.vdocuments.us/reader030/viewer/2022033106/56814457550346895db0f28a/html5/thumbnails/35.jpg)
![Page 36: Blast 2.0 Details](https://reader030.vdocuments.us/reader030/viewer/2022033106/56814457550346895db0f28a/html5/thumbnails/36.jpg)
![Page 37: Blast 2.0 Details](https://reader030.vdocuments.us/reader030/viewer/2022033106/56814457550346895db0f28a/html5/thumbnails/37.jpg)
![Page 38: Blast 2.0 Details](https://reader030.vdocuments.us/reader030/viewer/2022033106/56814457550346895db0f28a/html5/thumbnails/38.jpg)
![Page 39: Blast 2.0 Details](https://reader030.vdocuments.us/reader030/viewer/2022033106/56814457550346895db0f28a/html5/thumbnails/39.jpg)
![Page 40: Blast 2.0 Details](https://reader030.vdocuments.us/reader030/viewer/2022033106/56814457550346895db0f28a/html5/thumbnails/40.jpg)
Profiles
• Rather than identifying only the “consensus” (i.e. most common) amino acid at a particular location, we can assign a probability to each amino acid in each position of the domain.
• Like a PSSM, but just for the domain.
1 2 3 A .1 .5 .25C .3 .1 .25D .2 .2 .25E .4 .2 .25
![Page 41: Blast 2.0 Details](https://reader030.vdocuments.us/reader030/viewer/2022033106/56814457550346895db0f28a/html5/thumbnails/41.jpg)
Applying a Profile
• Calculate score (probability of match) for a profile at each position in a sequence by multiplying individual probabilities.
• Use “Sliding window”:
• Can transform probability to significance given random distribution assumption
1 2 3 A .1 .5 .25C .3 .1 .25D .2 .2 .25E .4 .2 .25
For sequence EACDC:EAC = .4 * .5 * .25 = .05ACD = .1 * .1 * .25 = .0025CDC = .3 * .2 * .25 = .015
![Page 42: Blast 2.0 Details](https://reader030.vdocuments.us/reader030/viewer/2022033106/56814457550346895db0f28a/html5/thumbnails/42.jpg)
Sequence Logos
![Page 43: Blast 2.0 Details](https://reader030.vdocuments.us/reader030/viewer/2022033106/56814457550346895db0f28a/html5/thumbnails/43.jpg)
![Page 44: Blast 2.0 Details](https://reader030.vdocuments.us/reader030/viewer/2022033106/56814457550346895db0f28a/html5/thumbnails/44.jpg)
![Page 45: Blast 2.0 Details](https://reader030.vdocuments.us/reader030/viewer/2022033106/56814457550346895db0f28a/html5/thumbnails/45.jpg)
![Page 46: Blast 2.0 Details](https://reader030.vdocuments.us/reader030/viewer/2022033106/56814457550346895db0f28a/html5/thumbnails/46.jpg)
![Page 47: Blast 2.0 Details](https://reader030.vdocuments.us/reader030/viewer/2022033106/56814457550346895db0f28a/html5/thumbnails/47.jpg)
![Page 48: Blast 2.0 Details](https://reader030.vdocuments.us/reader030/viewer/2022033106/56814457550346895db0f28a/html5/thumbnails/48.jpg)
![Page 49: Blast 2.0 Details](https://reader030.vdocuments.us/reader030/viewer/2022033106/56814457550346895db0f28a/html5/thumbnails/49.jpg)
![Page 50: Blast 2.0 Details](https://reader030.vdocuments.us/reader030/viewer/2022033106/56814457550346895db0f28a/html5/thumbnails/50.jpg)
![Page 51: Blast 2.0 Details](https://reader030.vdocuments.us/reader030/viewer/2022033106/56814457550346895db0f28a/html5/thumbnails/51.jpg)
![Page 52: Blast 2.0 Details](https://reader030.vdocuments.us/reader030/viewer/2022033106/56814457550346895db0f28a/html5/thumbnails/52.jpg)
![Page 53: Blast 2.0 Details](https://reader030.vdocuments.us/reader030/viewer/2022033106/56814457550346895db0f28a/html5/thumbnails/53.jpg)