how the blast work

Know before analyze blast result

BLASTa sequence comparison algorithm used to search databases for optimal local alignments to a query. search is done for a word of length "W" that scores at least "T" when compared to the query using a substitution matrix. generate an alignment with a score exceeding the threshold of "S". The "T" parameter dictates the speed and sensitivity of the search.

query

The input sequence (or other type of search term) to which all of the entries in a database are to be compared.

Subject

The output sequences that BLAST found after searching the datbase

AlgorithmA fixed procedure embodied in a computer program.

Alignment

The process or result of matching up the nucleotide or amino acid residues of two or more biological sequences to achieve maximal levels of identity and, in the case of amino acid sequences the degree of similarity and the possibility of homology.

Bit score

The bit score, S', is derived from the raw alignment score, S, taking the statistical properties of the scoring system into account. Because bit scores are normalized with respect to the scoring system, they can be used to compare alignment scores from different searches.

E value

The Expectation value or Expect value represents the number of different alignments with scores equivalent to or better than S that is expected to occur in a database search by chance.

The lower the E value, the more significant the score and the alignment.

gap

A space introduced into an alignment to compensate for insertions and deletions in one sequence relative to another.

H

H is the relative entropy of the target and background residue frequencies. (Karlin and Altschul, 1990).

A measure of the average information (in bits) available per position that distinguishes an alignment from chance.

At high values of H short alignments can be distinguished by chance, whereas at lower H values a longer alignment may be necessary (Altschul, 1991).

similarity

The extent to which nucleotide or protein sequences are related.

Similarity between two sequences can be expressed as percent sequence identity and/or percent positive substitutions.

identity

The extent to which two (nucleotide or amino acid) sequences have the same residues at the same positions in an alignment, often expressed as a percentage.

K

a natural scale for search space sizeused in converting a raw score (S) to a bit score (S').

lambda

a natural scale for scoring system.used in converting a raw score (S) to a bit score (S').

p value

The probability of a chance alignment occurring with a particular score or a better score in a database search.

The most highly significant P values will be those close to 0.

P values and E values are different ways of representing the significance of the alignment.

What is the Expect (E) value? The Expect value (E) is a parameter that describes the number of hits one can "expect" to see by chance when searching a database of a particular size.

It decreases exponentially as the Score (S) of the match increases. Essentially, the E value describes the random background noise.

For example, an E value of 1 assigned to a hit can be interpreted as meaning that in a database of the current size one might expect to see 1 match with a similar score simply by chance. The lower the E-value, or the closer it is to zero, the more "significant" the match is.

You can change the Expect value threshold on most BLAST search pages.

When the Expect value is increased from the default value of 10, a larger list with more low-scoring hits can be reported.

HSP A High-scoring Segment Pair (HSP) is a local alignment with no gaps that achieves one of the highest alignment scores in a given search.

low complexity region

A region of biased composition in nucleic acid and protein sequences.

These include homopolymeric runs, short-period repeats, and subtler over representation of one or a few residues.

The SEG program is used to mask or filter low complexity regions in amino acid queries.

The DUST program is used to mask or filter such regions in nucleic acid queries.

masking

Also known as filtering.

The removal of repeated or low complexity regions from a sequence in order to improve the sensitivity of sequence similarity searches performed with that sequence

What is "low-complexity" sequence? Regions with low-complexity sequence have an unusual composition that can create problems in sequence similarity searching.

For amino acid queries this compositional bias is determined by the SEG program (Wootton and Federhen, 1996).

For nucleotide queries it is determined by the DustMasker program (Morgulis, et al.,2006). Low-complexity sequence can often be recognized by visual inspection. For example, the protein sequence PPCDPPPPPKDKKKKDDGPP has low complexity and so does the nucleotide sequence AAATAAAAAAAATAAAAAAT. Filters are used to remove low-complexity sequence because it can cause artifactual hits.

In BLAST searches performed without a filter, high scoring hits may be reported only because of the presence of a low-complexity region.

Most often, it is inappropriate to consider this type of match as the result of shared homology.

Rather, it is as if the low-complexity region is "sticky" and is pulling out many sequences that are not truly related.

local alignmentThe alignment of a high-scoring region of two nucleic acid or protein sequences. optimal alignment An alignment of two sequences with the highest possible score.

raw score

The score of an alignment, S, calculated as the sum of substitution and gap scores.

Substitution scores are given by a look-up table ( PAM, BLOSUM).

Gap scores are typically calculated as the sum of G, the gap opening penalty and L, the gap extension penalty.

For a gap of length n, the gap cost would be G+Ln.

AACGTTTCCAGTCCAAATAGCTAGGC***--*** *-***-**-******

AACCGTTC TACAATTACCTAGGC* = Positive match

- = MismatchGap= gap between the sequences

Score for each positive hit = 1, total no of positive hits=18

Score for each mismatch = 2, total number of mismatches= 5

Penalty score for each gap= 2 (you can set the penalty score of gap), total numbers of gaps =3

score = Number of positive hits × score of each positive hit- {(Number of mismatches × Score of each mismatches) + (Number of

gaps × score for each gap penalty)}

substitution

The presence of a non-identical amino acid at a given position in an alignment.

If the aligned residues have similar physico-chemical properties or have a positive score in the governing scoring matrix the substitution is said to be conservative.

substitution scoring matrix

A scoring matrix containing values proportional to the probability that amino acid i mutates into amino acid j for all pairs of amino acids.

Such matrices are constructed by assembling a large and diverse sample of verified pairwise alignments of protein sequences.

If the sample is large enough, the resulting matrices should reflect the true probabilities of mutations occurring through a period of evolution.

The BLOSUM matrices are examples of substitution scoring matrices.

PAM

Percent Accepted Mutation (PAM) is unit introduced by Margaret Dayhoff and colleagues to quantify the amount of evolutionary change in a protein sequence.

1.0 PAM unit is the amount of evolution that will change, on average, 1% of amino acids in a protein sequence.

A PAM(x) substitution matrix is a look-up table in which scores for each amino acid substitution have been calculated based on the frequency of that substitution in closely related proteins that have experienced a certain amount (x) of evolutionary divergence.

unitary matrix

Also known as identity matrix. This is a scoring system in which only identical characters receive a positive score

BLOSUM

A Blocks Substitution Matrix is a substitution scoring matrix in which scores for each position are derived from observations of the frequencies of substitutions in blocks of local alignments in related proteins.

Each matrix is tailored to a particular evolutionary distance.

In the BLOSUM62 matrix, for example, the alignment from which scores were derived was created using sequences sharing no more than 62% identity.

Sequences more identical than 62% are represented by a single sequence in the alignment so as to avoid over-weighting closely related family members. (Henikoff and Henikoff, 1992)

profileA table that lists the frequencies of each amino acid in each position of protein sequence alignment. Frequencies are calculated from multiple alignments of sequences containing a domain of interest.

PSSMA Position-Specific Scoring Matrix (PSSM) is a profile that gives the log-odds score for finding a particular matching amino acid in a target sequence.

How blast work

Using a heuristic method, BLAST finds similar sequences, not by comparing either sequence in its entirety, but rather by locating short matches between the two sequences

This process of finding initial words is called seeding.

While attempting to find similarity in sequences, sets of common letters, known as words, are very important.

For example, suppose that the sequence contains the following stretch of letters, GLKFA. If a BLASTp was being conducted under default conditions, the word size would be 3 letters.

In this case, using the given stretch of letters, the searched words would be GLK, LKF, KFA.

An overview of the BLASTP algorithm (a protein to protein search)

Remove low-complexity region or sequence repeats in the query sequence.

"Low-complexity region" means a region of a sequence composed of few kinds of elements.

These regions might give high scores that confuse the program to find the actual significant sequences in the database, so they should be filtered out.

The regions will be marked with an X (protein sequences) or N (nucleic acid sequences) and then be ignored by the BLAST program.

To filter out the low-complexity regions, the SEG program is used for protein sequences and the program DUST is used for DNA sequences.

On the other hand, the program XNU is used to mask off the tandem repeats in protein sequences.

Make a k-letter word list of the query sequence.

Take k=3 for example, we list the words of length 3 in the query protein sequence (k is usually 11 for a DNA sequence) "sequentially", until the last letter of the query sequence is included. The method is illustrated in figure 1.

List the possible matching words.

BLAST only cares about the high-scoring words.

The scores are created by comparing the word illustrated previously with all the 3-letter words.

By using the scoring matrix (substitution matrix) to score the comparison of each residue pair, there are 20^3 possible match scores for a 3-letter word.

For example, the score obtained by comparing PQG with PEG and PQA is 15 and 12, respectively.

For DNA words, a match is scored as +5 and a mismatch as -4, or as +2 and -3.

After that, a neighborhood word score threshold T is used to reduce the number of possible matching words.

The words whose scores are greater than the threshold T will remain in the possible matching words list, while those with lower scores will be discarded.

For example, PEG is kept, but PQA is abandoned when T is 13.

Organize the remaining high-scoring words into an efficient search tree.

This allows the program to rapidly compare the high-scoring words to the database sequences.

Repeat steps for each k-letter word in the query sequence.

Scan the database sequences for exact matches with the remaining high-scoring words.

The BLAST program scans the database sequences for the remaining high-scoring word, such as PEG, of each position.

If an exact match is found, this match is used to seed a possible un-gapped alignment between the query and database sequences.

Extend the exact matches to high-scoring segment pair (HSP).

The original version of BLAST stretches a longer alignment between the query and the database sequence in the left and right directions, from the position where the exact match occurred.

The extension does not stop until the accumulated total score of the HSP begins to decrease. A simplified example is presented in figure 2.

List all of the HSPs in the database whose score is high enough to be considered.

We list the HSPs whose scores are greater than the empirically determined cutoff score S. By examining the distribution of the alignment scores modeled by comparing random sequences, a cutoff score S can be determined such that its value is large enough to guarantee the significance of the remaining HSPs.

Evaluate the significance of the HSP score.

BLAST next assesses the statistical significance of each HSP score by exploiting the Gumbel extreme value distribution (EVD).

In accordance with the Gumbel EVD, the probability p of observing a score S equal to or greater than x is given by the equation

where

The statistical parameters λ and κ are estimated by fitting the distribution of the un-gapped local alignment scores, of the query sequence and a lot of shuffled versions (Global or local shuffling) of a database sequence, to the Gumbel extreme value distribution.

Note that λ and κ depend upon the substitution matrix, gap penalties, and sequence composition (the letter frequencies). and are the effective lengths of the query and database sequences, respectively.

Make two or more HSP regions into a longer alignment.

Sometimes, we find two or more HSP regions in one database sequence that can be made into a longer alignment.

This provides additional evidence of the relation between the query and database sequence.

There are two methods, the Poisson method and the sum-of-scores method, to compare the significance of the newly combined HSP regions.

Suppose that there are two combined HSP regions with the pairs of scores (65, 40) and (52, 45), respectively.

The Poisson method gives more significance to the set with the maximal lower score (45>40).

However, the sum-of-scores method prefers the first set, because 65+40 (105) is greater than 52+45(97). The original BLAST uses the Poisson method; gapped BLAST and the WU-BLAST use the sum-of scores method.

Show the gapped Smith-Waterman local alignments of the query and each of the matched database sequences.

The original BLAST only generates un-gapped alignments including the initially found HSPs individually, even when there is more than one HSP found in one database sequence.

Report every match whose expect score is lower than a threshold parameter E.

how the blast work

Education

lowcomplexity sequence

value e

raw alignment score

chance alignment

sequence relative

low complexity regions

input sequence

value threshold