blast. what is blast? “blast® (basic local alignment search tool) is a set of similarity search...
Post on 21-Dec-2015
218 views
TRANSCRIPT
BLAST
What is BLAST?
“BLAST® (Basic Local Alignment Search Tool) is a set of similarity search programs designed to explore all of the available sequence databases regardless of whether the query is protein or DNA.
The BLAST programs have been designed for speed, with a minimal sacrifice of sensitivity to distant sequence relationships.
The scores assigned in a BLAST search have a well-defined statistical interpretation, making real matches easier to distinguish from random background hits.
BLAST uses a heuristic algorithm which seeks local as opposed to global alignments and is therefore able to detect relationships among sequences which share only isolated regions of similarity (Altschul et al., 1990).
Selecting BLAST programme
blastpCompares an amino acid query sequence against a protein sequence database.
blastnCompares a nucleotide query sequence against a nucleotide sequence database.
blastx
Compares a nucleotide query sequence translated in all reading frames against a protein sequence database. You could use this option to find potential translation products of an unknown nucleotide sequence.
tblastnCompares a protein query sequence against a nucleotide sequence database dynamically translated in all reading frames.
tblastxCompares the six-frame translations of a nucleotide query sequence against the six-frame translations of a nucleotide sequence database.
Selecting the Database (protein)nr All non-redundant GenBank CDS translations+PDB+SwissProt+PIR
monthAll new or revised GenBank CDS translation+PDB+SwissProt+PIR reased in the last 30 days.
swissprotThe last major release of the SWISS-PROT protein sequence database (no updates). These are uploaded to our system when they are received from EMBL.
patents Protein sequences derived from the Patent division of GenBank.
yeastYeast (Saccharomyces cerevisiae) protein sequences. This database is not to be confused with a listing of all Yeast protein sequences. It is a database of the protein translations of the Yeast complete genome.
E. coli E. coli (Escherichia coli) genomic CDS translations.
pdbSequences derived from the 3-dimensional structure Brookhaven Protein Data Bank.
aluTranslations of select Alu repeats from REPBASE, suitable for masking Alu repeats from query sequences.
Nucleotide Databases
nrAll non-redundant GenBank+EMBL+DDBJ+PDB sequences (but no EST, STS, GSS, or HTGS sequences).
monthAll new or revised GenBank+EMBL+DDBJ+PDB sequences released in the last 30 days.
dbest Non-redundant database of GenBank+EMBL+DDBJ EST Divisions.
dbsts Non-redundant database of GenBank+EMBL+DDBJ STS Divisions.
mouse ests
The non-redundant Database of GenBank+EMBL+DDBJ EST Divisions limited to the organism mouse.
human ests
The Non-redundant Database of GenBank+EMBL+DDBJ EST Divisions limited to the organism human.
other ests
The non-redundant database of GenBank+EMBL+DDBJ EST Divisions all organisms except mouse and human.
yeastYeast (Saccharomyces cerevisiae) genomic nucleotide sequences. Not a collection of all Yeast nucelotides sequences, but the sequence fragments from the Yeast complete genome.
E. coli E. coli (Escherichia coli) genomic nucleotide sequences.
Entering your Sequence
The BLAST web pages accept input sequences in three formats;
FASTA sequence format, NCBI Accession numbers, or GIs.
Setting Up a Query
Is the query sequence represented in the database?
Choose a current nucleic acid database. Select from among organism-specific (e.g.: yeast), inclusive (e.g., nonredundant), or specialized set (e.g., dbEST, dbSTS, GSS, HTG) databases
blastn
Are there homologs or evolutionary relatives of the query sequence in the database? Are there proteins whose function is related to the query sequence?
Choose a protein database if the query is protein or DNA expected to encode a protein because amino acid searches are more sensitive
blastp for amino acid queries; blastx for translated nucleic acid queries. Use Tblastn or tblastx for comparisons of an amino acid or translated nucleic acid query versus a translated nucleic acid database.
search parameters
Default
Special Cases
Short Query
Large Sequence
Family
Ungapped BLAST
Filter on off on on
Scoring Matrix
BLOSUM62PAM30 for
35 and under
BLOSUM62 BLOSUM62
Word Size 3 3, or
reduce to 2
3 3
E value 101000 or more
10 10
Gap costs 11,1 11,1 11,1 4
Alignments 50 50 2000 50
Filter
The default setting will filter repetitive or low-complexity sequences from the query using the SEG (protein) or DUST (nucleic acid) programs
If a low complexity region in the query is of interest, filtering will need tobe turned off.
If the number of hits returned is small when searching with a short query, it may help to re-search with filtering turned off.
The Human repeat filter option human repeats such as LINEs and SINEs and is especially useful for human sequences that may contain these repeats.
Scoring Matrices
BLOSUM62 is the default matrix. The BLOSUM matrix assigns a probability score for each position in an alignment that is based on the frequency with which that substitution is known to occur within conserved blocks of related proteins.
BLOSUM62 has been empirically shown to be among the best for detecting weak protein similarities
Other supported options include PAM30, PAM70, BLOSUM80, and BLOSUM45.
Short query sequences can only produce short alignments, and therefore database searches with short queries should use an appropriately tailored matrix. The BLOSUM series does not include any matrices with relative entropies suitable for the shortest queries, so the older PAM matrices may be used instead.
No alternate scoring matrices are available for BLASTN
Gap opening and gap extension penalties
BLASTprogram
Default GapPenalty (G)
Default GapExtension Penalty (E)
Other supported (G)and (E) values
blastp -11 -1
-10, -1; -10, -2; -11, -1; -8, -2;-9, -2
blastn -5 -2 none
E value threshold
The E value for an alignment score "S" represents the number of hits with a score equal to or better than "S" that would be "expected" by chance (the background noise) when searchinga database of a particular size.
The default E value for blastn, blastp, blastx and tblastn is 10. At this setting, 10 hits with scores equal to or better than the defined alignment score, S, are expected to occur by chance (in a search of the database using a random query with similar length).
Increase the E value to 1000 or more when searching with a short query, since it is likely to be found many times by chancein a given database.
Alignments
If the number of alignments requested (x) is fewer than those exceeding the significance threshold only the top (x) hits will be reported.
To detect low-similarity matches, the number of alignments to be shown should be increased when searching with a member of a large sequence family.
Analyzing the output
Step 1. Examine the alignment scores and statistics
Scores for each position of an alignment are derived from asubstitution matrix
The raw score "S" of the alignment is usually calculated bysumming the scores for each letter-to-letter and letter-to-nullposition in the alignment.
The bit score is calculated from the raw score by normalizing with the statistical variables that define a given scoring system. Therefore, bit scores from different alignments, even those employing different scoring matrices can be compared.
The higher the score the better the alignment
There is no widely accepted theory for selecting gap costs.
It is rarely necessary to change gap opening or extension Values from the default.
Statistics
Local alignments with no gaps are referred to as High scoring pairs (HSPs).
For gapped alignments, the significance of a given alignment with score S is represented by the E (Expect) value (shown inthe right-most column in the output), the expected number ofchance alignments with a score of S or better.
The E value decreases exponentially as the Score (S) that is assigned to a match between two sequences increases.
A convenient way to create a significance threshold for reporting hits is to alter the E value. When the Expect value threshold is increased from the default value of 10, more hits can be reported.
Step 2. Examine the alignments
Descriptions
The highest scoring alignments are described by one line summaries called "descriptions".
The description lines are sorted by increasing E value, thus the most significant alignments (lowest E values) are at the top.
Graphic Representation
At the top is a linear map of the query. Each bar drawn below the map represents a protein (or protein fragment) that matches the query sequence. The position of each bar relative to the linear map of the query allows the user to see instantly the extent to which the database matches align with a single or multiple regions of the query.
The most similar hits are shown at the top in red. Pink, green, blue and black bars follow, representing proteins in decreasing order of similarity.
PSI_BLAST
Position-Specific Iterative (PSI) BLAST is a program based on the BLAST 2.0 algorithm that is designed to detect weak relationships between the query and members of the database not necessarily detectable by standard BLAST searches.
The added sensitivity of this program over regular BLAST comes from the use of a profile that is constructed (automatically) from a multiple alignment of the highest scoring hits in the initial BLAST search.
A highly conserved position will receive a high score and weakly conserved positions receive scores near zero.
The profile is then used to perform additional BLAST searches (called iterations) and the results of each iteration used to refine the profile.
PSI-BLAST analysis is useful both for identifying the distant members of a protein family, whose relationship is not recognizable by straight sequence comparison, and also for deducing the function of hypothetical proteins that are unannotated in the database.
A PSI-BLAST query is identical to a BLAST query with added specification by the use of the expectation (E) value cut-off for inclusion of a match in the first and subsequent iterations.
The initial PSI-BLAST search uses the same matrix options available for Gapped BLAST, since it is a Gapped BLAST search.
The user can continue to search iteratively until satisfied that no new matches will be identified. The point at which no new hits are identified by additional searches is known as "convergence".
Motif searching with PHI-BLAST
A new service called Pattern Hit Initiated BLAST (PHI-BLAST), that searches for particular patterns in protein queries is now available in Version 2.0 of the BLAST program suite.
PHI-BLAST expects as input a protein query sequence and a pattern contained in that sequence. PHI-BLAST searches the specified database for other proteinsequences that also contain the input pattern and have significant similarity to the query sequence in the vicinity of the pattern occurrences.
PHI-BLAST is integrated with Position-Specific Iterated BLAST (PSI-BLAST), so that the results of a PHI-BLAST query can be used to initiate one or more rounds of PSI-BLAST searching.